linkcheck.robotparser2

Robots.txt parser.

The robots.txt Exclusion Protocol is implemented as specified in https://www.robotstxt.org/norobots-rfc.txt

Classes

Entry()

An entry has one or more user-agents and zero or more rulelines.

RobotFileParser(session[, url, auth, timeout])

This class provides a set of methods to read, parse and answer questions about a single robots.txt file.

RuleLine(path, allowance)

A rule line is a single "Allow:" (allowance==1) or "Disallow:" (allowance==0) followed by a path.

class linkcheck.robotparser2.RobotFileParser(session, url='', auth=None, timeout=None)[source]

Bases: object

This class provides a set of methods to read, parse and answer questions about a single robots.txt file.

Initialize internal entry lists and store given url and credentials.

can_fetch(useragent, url)[source]

Using the parsed robots.txt decide if useragent can fetch url.

Returns:

True if agent can fetch url, else False

Return type:

bool

get_crawldelay(useragent)[source]

Look for a configured crawl delay.

Returns:

crawl delay in seconds or zero

Return type:

integer >= 0

modified()[source]

Set the time the robots.txt file was last fetched to the current time.

mtime()[source]

Returns the time the robots.txt file was last fetched.

This is useful for long-running web spiders that need to check for new robots.txt files periodically.

Returns:

last modified in time.time() format

Return type:

number

parse(lines)[source]

Parse the input lines from a robot.txt file. We allow that a user-agent: line is not preceded by one or more blank lines.

Returns:

None

read()[source]

Read the robots.txt URL and feeds it to the parser.

set_url(url)[source]

Set the URL referring to a robots.txt file.