linkcheck.checker.httpurl

Handle http links.

Classes

HttpUrl(base_url, recursion_level, aggregate)

Url link with http scheme.

class linkcheck.checker.httpurl.HttpUrl(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=-1, column=-1, page=-1, name='', url_encoding=None, extern=None)[source]

Bases: InternPatternUrl

Url link with http scheme.

Initialize check data, and store given variables.

Parameters:
  • base_url – unquoted and possibly unnormed url

  • recursion_level – on what check level lies the base url

  • aggregate – aggregate instance

  • parent_url – quoted and normed url of parent or None

  • base_ref – quoted and normed url of <base href=””> or None

  • line – line number of url in parent content

  • column – column number of url in parent content

  • page – page number of url in parent content

  • name – name of url or empty

  • url_encoding – encoding of URL or None

  • extern – None or (is_extern, is_strict)

add_size_info()[source]

Get size of URL content from HTTP header.

allows_robots(url)[source]

Fetch and parse the robots.txt of given url. Checks if LinkChecker can get the requested resource content.

Parameters:

url (string) – the url to be requested

Returns:

True if access is granted, otherwise False

Return type:

bool

build_request()[source]

Build a prepared request object.

check_connection()[source]

Check a URL with HTTP protocol. Here is an excerpt from RFC 1945 with common response codes: The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. There are 5 values for the first digit:

  • 1xx: Informational - Not used, but reserved for future use

  • 2xx: Success - The action was successfully received, understood, and accepted

  • 3xx: Redirection - Further action must be taken in order to complete the request

  • 4xx: Client Error - The request contains bad syntax or cannot be fulfilled

  • 5xx: Server Error - The server failed to fulfill an apparently valid request

check_response()[source]

Check final result and log it.

construct_auth()[source]

Construct HTTP Basic authentication credentials if there is user/password information available. Does not overwrite if credentials have already been constructed.

content_allows_robots()[source]

Return False if the content of this URL forbids robots to search for recursive links.

follow_redirections(request)[source]

Follow all redirections of http response.

get_content()[source]
get_redirects(request)[source]

Return iterator of redirects for given request.

get_request_kwargs()[source]

Construct keyword parameters for Session.request() and Session.resolve_redirects().

get_robots_txt_url()[source]

Get the according robots.txt URL for this URL.

Returns:

robots.txt URL

Return type:

string

is_parseable()[source]

Check if content is parseable for recursion.

Returns:

True if content is parseable

Return type:

bool

is_redirect()[source]

Check if current response is a redirect.

Parse URLs in HTTP headers Link:.

read_content()[source]

Return data and data size for this URL. Can be overridden in subclasses.

reset()[source]

Initialize HTTP specific variables.

send_request(request)[source]

Send request and store response in self.url_connection.

set_content_type()[source]

Set MIME type from HTTP response headers.

set_encoding(encoding)[source]

Set content encoding