linkcheck.robotparser2

Robots.txt parser.

The robots.txt Exclusion Protocol is implemented as specified in https://www.robotstxt.org/norobots-rfc.txt

Classes

`Entry`()	An entry has one or more user-agents and zero or more rulelines.
`RobotFileParser`(session[, url, auth, timeout])	This class provides a set of methods to read, parse and answer questions about a single robots.txt file.
`RuleLine`(path, allowance)	A rule line is a single "Allow:" (allowance==1) or "Disallow:" (allowance==0) followed by a path.

class linkcheck.robotparser2.RobotFileParser(session, url='', auth=None, timeout=None)[source]

Bases: object

This class provides a set of methods to read, parse and answer questions about a single robots.txt file.

Initialize internal entry lists and store given url and credentials.

can_fetch(useragent, url)[source]

Using the parsed robots.txt decide if useragent can fetch url.

get_crawldelay(useragent)[source]

Look for a configured crawl delay.

modified()[source]: Set the time the robots.txt file was last fetched to the current time.

Returns the time the robots.txt file was last fetched.

This is useful for long-running web spiders that need to check for new robots.txt files periodically.

parse(lines)[source]

Parse the input lines from a robot.txt file. We allow that a user-agent: line is not preceded by one or more blank lines.