linkcheck.checker.urlbase

Base URL handler.

Functions

`url_norm`(url, encoding)	Wrapper for url.url_norm() to convert UnicodeError in LinkCheckerError.
`urljoin`(parent, url)	If url is relative, join parent and url.

Classes

`CompactUrlData`(wired_url_data)	Store selected UrlData attributes in slots to minimize memory usage.
`UrlBase`(base_url, recursion_level, aggregate)	An URL with additional information like validity etc.

class linkcheck.checker.urlbase.CompactUrlData(wired_url_data)[source]

Bases: object

Store selected UrlData attributes in slots to minimize memory usage.

Set all attributes according to the dictionary wired_url_data

base_ref

base_url

cache_url

checktime

column

content_type

dltime

domain

extern

info

level

line

modified

name

page

parent_url

result

size

title

url

valid

warnings

class linkcheck.checker.urlbase.UrlBase(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=-1, column=-1, page=-1, name='', url_encoding=None, extern=None)[source]

Bases: object

An URL with additional information like validity etc.

Initialize check data, and store given variables.

Parameters:

base_url – unquoted and possibly unnormed url
recursion_level – on what check level lies the base url
aggregate – aggregate instance
parent_url – quoted and normed url of parent or None
base_ref – quoted and normed url of <base href=””> or None
line – line number of url in parent content
column – column number of url in parent content
page – page number of url in parent content
name – name of url or empty
url_encoding – encoding of URL or None
extern – None or (is_extern, is_strict)

add_info(s)[source]: Add an info string.

add_intern_pattern(url=None)[source]: Add intern URL regex to config.

add_size_info()[source]: Set size of URL content (if any).. Should be overridden in subclasses.

add_url(url, line=0, column=0, page=0, name='', base=None, parent=None)[source]: Add new URL to queue.

add_warning(s, tag=None)[source]: Add a warning string.

allows_recursion()[source]: Return True iff we can recurse into the url’s content.

allows_simple_recursion()[source]: Check recursion level and extern status.

build_url()[source]: Construct self.url and self.urlparts out of the given base url information self.base_url, self.parent_url and self.base_ref.

build_url_parts(url)[source]: Set userinfo, host, port and anchor from url and return urlparts. Also checks for obfuscated IP addresses.

can_get_content()[source]: Indicate whether url get_content() can be called.

check()[source]: Main check function for checking this URL.

check_connection()[source]: The basic connection check uses urlopen to initialize a connection object.

check_content()[source]: Check content of URL. :return: True if content can be parsed, else False

check_obfuscated_ip()[source]: Warn if host of this URL is obfuscated IP address.

check_syntax()[source]: Called before self.check(), this function inspects the url syntax. Success enables further checking, failure immediately logs this url. Syntax checks must not use any network resources.

check_url_warnings()[source]: Check URL name and length.

close_connection()[source]: Close an opened url connection.

content_allows_robots()[source]: Returns True: only check robots.txt on HTTP links.

download_content()[source]

get_content(encoding=None)[source]

get_intern_pattern(url=None)[source]

Get pattern for intern URL matching.

Parameters:: url (unicode or None) – the URL to set intern pattern for, else self.url
Returns:: non-empty regex pattern or None
Return type:: String or None

get_raw_content()[source]

get_soup()[source]

get_title()[source]: Return title of page the URL refers to. This is per default the filename or the URL.

get_user_password()[source]: Get tuple (user, password) from configured authentication. Both user and password can be None.

handle_exception()[source]: An exception occurred. Log it and set the cache flag.

init(base_ref, base_url, parent_url, recursion_level, aggregate, line, column, page, name, url_encoding, extern)[source]: Initialize internal data.

is_content_type_parseable()[source]: Return True iff the content type of this url is parseable.

is_css()[source]: Return True iff content of this url is CSS stylesheet.

is_directory()[source]: Return True if current URL represents a directory.

is_file()[source]: Return True for file:// URLs.

is_html()[source]: Return True iff content of this url is HTML formatted.

is_http()[source]: Return True for http:// or https:// URLs.

is_local()[source]: Return True for local (ie. file://) URLs.

is_parseable()[source]: Return True iff content of this url is parseable.

local_check()[source]: Local check function can be overridden in subclasses.

read_content()[source]: Return data for this URL. Can be overridden in subclasses.

read_content_chunk()[source]: Read one chunk of content from this URL. Precondition: url_connection is an opened URL.

reset()[source]: Reset all variables to default values.

serialized(sep='\n')[source]: Return serialized url check data as unicode string.

set_cache_url()[source]: Set the URL to be used for caching.

set_content_type()[source]: Set content MIME type. Should be overridden in subclasses.

set_extern(url)[source]

Match URL against extern and intern link patterns. If no pattern matches the URL is extern. Sets self.extern to a tuple (bool, bool) with content (is_extern, is_strict).

Returns:: None

set_result(msg, valid=True, overwrite=False)[source]: Set result string and validity.

should_ignore_warning(tag)[source]: Return true if a warning should be ignored

to_wire()[source]: Return compact UrlData object with information from to_wire_dict().

to_wire_dict()[source]

Return a simplified transport object for logging and caching.

The transport object must contain these attributes:

url_data.valid: bool Indicates if URL is valid
url_data.result: unicode Result string
url_data.warnings: list of tuples (tag, warning message) List of tagged warnings for this URL.
url_data.name: unicode string or None name of URL (eg. filename or link name)
url_data.parent_url: unicode or None Parent URL
url_data.base_ref: unicode HTML base reference URL of parent
url_data.url: unicode Fully qualified URL.
url_data.domain: unicode URL domain part.
url_data.checktime: int Number of seconds needed to check this link, default: zero.
url_data.dltime: int Number of seconds needed to download URL content, default: -1
url_data.size: int Size of downloaded URL content, default: -1
url_data.info: list of unicode Additional information about this URL.
url_data.line: int Line number of this URL at parent document, or None
url_data.column: int Column number of this URL at parent document, or None
url_data.page: int Page number of this URL at parent document, or -1
url_data.cache_url: unicode Cache url for this URL.
url_data.content_type: unicode MIME content type for URL content.
url_data.level: int Recursion level until reaching this URL from start URL
url_data.last_modified: datetime Last modification date of retrieved page (or None).

ContentMimetypes = {'application/msword': 'word', 'application/pdf': 'pdf', 'application/vnd.adobe.flash.movie': 'swf', 'application/x-httpd-php': 'html', 'application/x-pdf': 'pdf', 'application/x-plist+safari': 'safari', 'application/x-shockwave-flash': 'swf', 'application/xhtml+xml': 'html', 'application/xml+sitemap': 'sitemap', 'application/xml+sitemapindex': 'sitemapindex', 'text/css': 'css', 'text/html': 'html', 'text/plain+chromium': 'chromium', 'text/plain+linkchecker': 'text', 'text/plain+opera': 'opera', 'text/vnd.wap.wml': 'wml'}

ReadChunkBytes = 16384

linkcheck.checker.urlbase.url_norm(url, encoding)[source]: Wrapper for url.url_norm() to convert UnicodeError in LinkCheckerError.

linkcheck.checker.urlbase.urljoin(parent, url)[source]

If url is relative, join parent and url. Else leave url as-is.

Returns:: joined url