linkcheck.checker.urlbase
Base URL handler.
Functions
|
Wrapper for url.url_norm() to convert UnicodeError in LinkCheckerError. |
|
If url is relative, join parent and url. |
Classes
|
Store selected UrlData attributes in slots to minimize memory usage. |
|
An URL with additional information like validity etc. |
- class linkcheck.checker.urlbase.CompactUrlData(wired_url_data)[source]
Bases:
object
Store selected UrlData attributes in slots to minimize memory usage.
Set all attributes according to the dictionary wired_url_data
- base_ref
- base_url
- cache_url
- checktime
- column
- content_type
- dltime
- domain
- extern
- info
- level
- line
- modified
- name
- page
- parent_url
- result
- size
- title
- url
- valid
- warnings
- class linkcheck.checker.urlbase.UrlBase(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=-1, column=-1, page=-1, name='', url_encoding=None, extern=None)[source]
Bases:
object
An URL with additional information like validity etc.
Initialize check data, and store given variables.
- Parameters:
base_url – unquoted and possibly unnormed url
recursion_level – on what check level lies the base url
aggregate – aggregate instance
parent_url – quoted and normed url of parent or None
base_ref – quoted and normed url of <base href=””> or None
line – line number of url in parent content
column – column number of url in parent content
page – page number of url in parent content
name – name of url or empty
url_encoding – encoding of URL or None
extern – None or (is_extern, is_strict)
- add_url(url, line=0, column=0, page=0, name='', base=None, parent=None)[source]
Add new URL to queue.
- build_url()[source]
Construct self.url and self.urlparts out of the given base url information self.base_url, self.parent_url and self.base_ref.
- build_url_parts(url)[source]
Set userinfo, host, port and anchor from url and return urlparts. Also checks for obfuscated IP addresses.
- check_connection()[source]
The basic connection check uses urlopen to initialize a connection object.
- check_syntax()[source]
Called before self.check(), this function inspects the url syntax. Success enables further checking, failure immediately logs this url. Syntax checks must not use any network resources.
- get_intern_pattern(url=None)[source]
Get pattern for intern URL matching.
- Parameters:
url (unicode or None) – the URL to set intern pattern for, else self.url
- Returns:
non-empty regex pattern or None
- Return type:
String or None
- get_title()[source]
Return title of page the URL refers to. This is per default the filename or the URL.
- get_user_password()[source]
Get tuple (user, password) from configured authentication. Both user and password can be None.
- init(base_ref, base_url, parent_url, recursion_level, aggregate, line, column, page, name, url_encoding, extern)[source]
Initialize internal data.
- read_content_chunk()[source]
Read one chunk of content from this URL. Precondition: url_connection is an opened URL.
- set_extern(url)[source]
Match URL against extern and intern link patterns. If no pattern matches the URL is extern. Sets self.extern to a tuple (bool, bool) with content (is_extern, is_strict).
- Returns:
None
- to_wire_dict()[source]
Return a simplified transport object for logging and caching.
The transport object must contain these attributes:
url_data.valid: bool Indicates if URL is valid
url_data.result: unicode Result string
url_data.warnings: list of tuples (tag, warning message) List of tagged warnings for this URL.
url_data.name: unicode string or None name of URL (eg. filename or link name)
url_data.parent_url: unicode or None Parent URL
url_data.base_ref: unicode HTML base reference URL of parent
url_data.url: unicode Fully qualified URL.
url_data.domain: unicode URL domain part.
url_data.checktime: int Number of seconds needed to check this link, default: zero.
url_data.dltime: int Number of seconds needed to download URL content, default: -1
url_data.size: int Size of downloaded URL content, default: -1
url_data.info: list of unicode Additional information about this URL.
url_data.line: int Line number of this URL at parent document, or None
url_data.column: int Column number of this URL at parent document, or None
url_data.page: int Page number of this URL at parent document, or -1
url_data.cache_url: unicode Cache url for this URL.
url_data.content_type: unicode MIME content type for URL content.
url_data.level: int Recursion level until reaching this URL from start URL
url_data.last_modified: datetime Last modification date of retrieved page (or None).
- ContentMimetypes = {'application/msword': 'word', 'application/pdf': 'pdf', 'application/vnd.adobe.flash.movie': 'swf', 'application/x-httpd-php': 'html', 'application/x-pdf': 'pdf', 'application/x-plist+safari': 'safari', 'application/x-shockwave-flash': 'swf', 'application/xhtml+xml': 'html', 'application/xml+sitemap': 'sitemap', 'application/xml+sitemapindex': 'sitemapindex', 'text/css': 'css', 'text/html': 'html', 'text/plain+chromium': 'chromium', 'text/plain+linkchecker': 'text', 'text/plain+opera': 'opera', 'text/vnd.wap.wml': 'wml'}
- ReadChunkBytes = 16384