linkcheck.checker.httpurl

Handle http links.

Classes

HttpUrl(base_url, recursion_level, aggregate)

Url link with http scheme.

class linkcheck.checker.httpurl.HttpUrl(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=-1, column=-1, page=-1, name='', url_encoding=None, extern=None)[source]

Bases: InternPatternUrl

Url link with http scheme.

Initialize check data, and store given variables.

Parameters:

base_url – unquoted and possibly unnormed url
recursion_level – on what check level lies the base url
aggregate – aggregate instance
parent_url – quoted and normed url of parent or None
base_ref – quoted and normed url of <base href=””> or None
line – line number of url in parent content
column – column number of url in parent content
page – page number of url in parent content
name – name of url or empty
url_encoding – encoding of URL or None
extern – None or (is_extern, is_strict)

add_size_info()[source]: Get size of URL content from HTTP header.

allows_robots(url)[source]

Fetch and parse the robots.txt of given url. Checks if LinkChecker can get the requested resource content.

Parameters:: url (string) – the url to be requested
Returns:: True if access is granted, otherwise False
Return type:: bool

build_request()[source]: Build a prepared request object.

check_connection()[source]

Check a URL with HTTP protocol. Here is an excerpt from RFC 1945 with common response codes: The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. There are 5 values for the first digit:

1xx: Informational - Not used, but reserved for future use

2xx: Success - The action was successfully received, understood, and accepted

3xx: Redirection - Further action must be taken in order to complete the request

4xx: Client Error - The request contains bad syntax or cannot be fulfilled

5xx: Server Error - The server failed to fulfill an apparently valid request

check_response()[source]: Check final result and log it.

construct_auth()[source]: Construct HTTP Basic authentication credentials if there is user/password information available. Does not overwrite if credentials have already been constructed.

content_allows_robots()[source]: Return False if the content of this URL forbids robots to search for recursive links.

follow_redirections(request)[source]: Follow all redirections of http response.

get_content()[source]

get_redirects(request)[source]: Return iterator of redirects for given request.

get_request_kwargs()[source]: Construct keyword parameters for Session.request() and Session.resolve_redirects().

get_robots_txt_url()[source]

Get the according robots.txt URL for this URL.

Returns:: robots.txt URL
Return type:: string

is_parseable()[source]

Check if content is parseable for recursion.

Returns:: True if content is parseable
Return type:: bool

is_redirect()[source]: Check if current response is a redirect.

parse_header_links()[source]: Parse URLs in HTTP headers Link:.

read_content()[source]: Return data and data size for this URL. Can be overridden in subclasses.

reset()[source]: Initialize HTTP specific variables.

send_request(request)[source]: Send request and store response in self.url_connection.

set_content_type()[source]: Set MIME type from HTTP response headers.

set_encoding(encoding)[source]: Set content encoding