linkcheck.robotparser2
Robots.txt parser.
The robots.txt Exclusion Protocol is implemented as specified in https://www.robotstxt.org/norobots-rfc.txt
Classes
|
An entry has one or more user-agents and zero or more rulelines. |
|
This class provides a set of methods to read, parse and answer questions about a single robots.txt file. |
|
A rule line is a single "Allow:" (allowance==1) or "Disallow:" (allowance==0) followed by a path. |
- class linkcheck.robotparser2.RobotFileParser(session, url='', auth=None, timeout=None)[source]
Bases:
object
This class provides a set of methods to read, parse and answer questions about a single robots.txt file.
Initialize internal entry lists and store given url and credentials.
- can_fetch(useragent, url)[source]
Using the parsed robots.txt decide if useragent can fetch url.
- Returns:
True if agent can fetch url, else False
- Return type:
bool
- get_crawldelay(useragent)[source]
Look for a configured crawl delay.
- Returns:
crawl delay in seconds or zero
- Return type:
integer >= 0
- mtime()[source]
Returns the time the robots.txt file was last fetched.
This is useful for long-running web spiders that need to check for new robots.txt files periodically.
- Returns:
last modified in time.time() format
- Return type:
number