linkcheck.checker.urlbase

Base URL handler.

Functions

url_norm(url, encoding)

Wrapper for url.url_norm() to convert UnicodeError in LinkCheckerError.

urljoin(parent, url)

If url is relative, join parent and url.

Classes

CompactUrlData(wired_url_data)

Store selected UrlData attributes in slots to minimize memory usage.

UrlBase(base_url, recursion_level, aggregate)

An URL with additional information like validity etc.

class linkcheck.checker.urlbase.CompactUrlData(wired_url_data)[source]

Bases: object

Store selected UrlData attributes in slots to minimize memory usage.

Set all attributes according to the dictionnary wired_url_data

base_ref
base_url
cache_url
checktime
column
content_type
dltime
domain
extern
info
level
line
modified
name
page
parent_url
result
size
title
url
valid
warnings
class linkcheck.checker.urlbase.UrlBase(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=- 1, column=- 1, page=- 1, name='', url_encoding=None, extern=None)[source]

Bases: object

An URL with additional information like validity etc.

Initialize check data, and store given variables.

Parameters
  • base_url – unquoted and possibly unnormed url

  • recursion_level – on what check level lies the base url

  • aggregate – aggregate instance

  • parent_url – quoted and normed url of parent or None

  • base_ref – quoted and normed url of <base href=””> or None

  • line – line number of url in parent content

  • column – column number of url in parent content

  • page – page number of url in parent content

  • name – name of url or empty

  • url_encoding – encoding of URL or None

  • extern – None or (is_extern, is_strict)

add_info(s)[source]

Add an info string.

add_intern_pattern(url=None)[source]

Add intern URL regex to config.

add_size_info()[source]

Set size of URL content (if any).. Should be overridden in subclasses.

add_url(url, line=0, column=0, page=0, name='', base=None)[source]

Add new URL to queue.

add_warning(s, tag=None)[source]

Add a warning string.

allows_recursion()[source]

Return True iff we can recurse into the url’s content.

allows_simple_recursion()[source]

Check recursion level and extern status.

build_url()[source]

Construct self.url and self.urlparts out of the given base url information self.base_url, self.parent_url and self.base_ref.

build_url_parts(url)[source]

Set userinfo, host, port and anchor from url and return urlparts. Also checks for obfuscated IP addresses.

can_get_content()[source]

Indicate wether url get_content() can be called.

check()[source]

Main check function for checking this URL.

check_connection()[source]

The basic connection check uses urlopen to initialize a connection object.

check_content()[source]

Check content of URL. :return: True if content can be parsed, else False

check_obfuscated_ip()[source]

Warn if host of this URL is obfuscated IP address.

check_syntax()[source]

Called before self.check(), this function inspects the url syntax. Success enables further checking, failure immediately logs this url. Syntax checks must not use any network resources.

check_url_warnings()[source]

Check URL name and length.

close_connection()[source]

Close an opened url connection.

content_allows_robots()[source]

Returns True: only check robots.txt on HTTP links.

download_content()[source]
get_content(encoding=None)[source]
get_intern_pattern(url=None)[source]

Get pattern for intern URL matching.

Parameters

url (unicode or None) – the URL to set intern pattern for, else self.url

Returns

non-empty regex pattern or None

Return type

String or None

get_raw_content()[source]
get_soup()[source]
get_title()[source]

Return title of page the URL refers to. This is per default the filename or the URL.

get_user_password()[source]

Get tuple (user, password) from configured authentication. Both user and password can be None.

handle_exception()[source]

An exception occurred. Log it and set the cache flag.

init(base_ref, base_url, parent_url, recursion_level, aggregate, line, column, page, name, url_encoding, extern)[source]

Initialize internal data.

is_css()[source]

Return True iff content of this url is CSS stylesheet.

is_directory()[source]

Return True if current URL represents a directory.

is_file()[source]

Return True for file:// URLs.

is_html()[source]

Return True iff content of this url is HTML formatted.

is_http()[source]

Return True for http:// or https:// URLs.

is_local()[source]

Return True for local (ie. file://) URLs.

is_parseable()[source]

Return True iff content of this url is parseable.

local_check()[source]

Local check function can be overridden in subclasses.

read_content()[source]

Return data for this URL. Can be overridden in subclasses.

read_content_chunk()[source]

Read one chunk of content from this URL. Precondition: url_connection is an opened URL.

reset()[source]

Reset all variables to default values.

serialized(sep='\n')[source]

Return serialized url check data as unicode string.

set_cache_url()[source]

Set the URL to be used for caching.

set_content_type()[source]

Set content MIME type. Should be overridden in subclasses.

set_extern(url)[source]

Match URL against extern and intern link patterns. If no pattern matches the URL is extern. Sets self.extern to a tuple (bool, bool) with content (is_extern, is_strict).

Returns

None

set_result(msg, valid=True, overwrite=False)[source]

Set result string and validity.

to_wire()[source]

Return compact UrlData object with information from to_wire_dict().

to_wire_dict()[source]

Return a simplified transport object for logging and caching.

The transport object must contain these attributes:

  • url_data.valid: bool Indicates if URL is valid

  • url_data.result: unicode Result string

  • url_data.warnings: list of tuples (tag, warning message) List of tagged warnings for this URL.

  • url_data.name: unicode string or None name of URL (eg. filename or link name)

  • url_data.parent_url: unicode or None Parent URL

  • url_data.base_ref: unicode HTML base reference URL of parent

  • url_data.url: unicode Fully qualified URL.

  • url_data.domain: unicode URL domain part.

  • url_data.checktime: int Number of seconds needed to check this link, default: zero.

  • url_data.dltime: int Number of seconds needed to download URL content, default: -1

  • url_data.size: int Size of downloaded URL content, default: -1

  • url_data.info: list of unicode Additional information about this URL.

  • url_data.line: int Line number of this URL at parent document, or None

  • url_data.column: int Column number of this URL at parent document, or None

  • url_data.page: int Page number of this URL at parent document, or -1

  • url_data.cache_url: unicode Cache url for this URL.

  • url_data.content_type: unicode MIME content type for URL content.

  • url_data.level: int Recursion level until reaching this URL from start URL

  • url_data.last_modified: datetime Last modification date of retrieved page (or None).

ContentMimetypes = {'application/msword': 'word', 'application/pdf': 'pdf', 'application/x-httpd-php': 'html', 'application/x-pdf': 'pdf', 'application/x-plist+safari': 'safari', 'application/x-shockwave-flash': 'swf', 'application/xhtml+xml': 'html', 'application/xml+sitemap': 'sitemap', 'application/xml+sitemapindex': 'sitemapindex', 'text/css': 'css', 'text/html': 'html', 'text/plain+chromium': 'chromium', 'text/plain+linkchecker': 'text', 'text/plain+opera': 'opera', 'text/vnd.wap.wml': 'wml'}
ReadChunkBytes = 16384
linkcheck.checker.urlbase.url_norm(url, encoding)[source]

Wrapper for url.url_norm() to convert UnicodeError in LinkCheckerError.

linkcheck.checker.urlbase.urljoin(parent, url)[source]

If url is relative, join parent and url. Else leave url as-is.

Returns

joined url