wrdrd.tools package¶

Submodules¶

wrdrd.tools.crawl module¶

class wrdrd.tools.crawl.CSS(loc, src)¶

Bases: tuple

loc¶: Alias for field number 0

src¶: Alias for field number 1

class wrdrd.tools.crawl.CrawlRequest(src, url, datetime)¶

Bases: tuple

datetime¶: Alias for field number 2

src¶: Alias for field number 0

url¶: Alias for field number 1

class wrdrd.tools.crawl.Image(loc, src, alt, height, width, text)¶

Bases: tuple

alt¶: Alias for field number 2

height¶: Alias for field number 3

loc¶: Alias for field number 0

src¶: Alias for field number 1

text¶: Alias for field number 5

width¶: Alias for field number 4

class wrdrd.tools.crawl.JS(loc, src)¶

Bases: tuple

loc¶: Alias for field number 0

src¶: Alias for field number 1

class wrdrd.tools.crawl.KeywordFrequency(url, frequencies)¶

Bases: tuple

frequencies¶: Alias for field number 1

url¶: Alias for field number 0

class wrdrd.tools.crawl.Link(loc, href, name, target, text, parent_id)¶

Bases: tuple

href¶: Alias for field number 1

loc¶: Alias for field number 0

name¶: Alias for field number 2

parent_id¶: Alias for field number 5

target¶: Alias for field number 3

text¶: Alias for field number 4

class wrdrd.tools.crawl.ResultStore[source]¶

Bases: object

Result store interface

itervalues()[source]¶

Get an iterable over the values in self.db

Returns:: an iterable over the values in self.db
Return type:: iterable

values()¶

Get an iterable over the values in self.db

Returns:: an iterable over the values in self.db
Return type:: iterable

class wrdrd.tools.crawl.URLCrawlQueue[source]¶

Bases: object

Queue of CrawlRequest URLs to crawl and their states

DONE = 2¶

ERROR = 3¶

NEW = 0¶

OUT = 1¶

count()[source]¶

Get the count of URLCrawlQueue.NEW CrawlRequest objects

Returns:

count of URLCrawlQueue.NEW CrawlRequest: objects

Return type:

int

done(item)[source]¶: Mark a CrawlRequest as URLCrawlQueue.DONE

error(item)[source]¶: Mark a CrawlRequest as URLCrawlQueue.ERROR

pop()[source]¶

Pop a CrawlRequest off the queue and mark it as URLCrawlQueue.OUT

Returns:: CrawlRequest
Return type:: CrawlRequest

push(item)[source]¶: Push a CrawlRequest onto the queue and mark it as URLCrawlQueue.NEW

wrdrd.tools.crawl.build_networkx_graph(url, links, label=None)[source]¶

Build a networkx.DiGraph from an iterable of links from a given URL

Parameters:

url (str) – URL from which the given links are derived
links (iterable) – iterable of Link objects
label (str) – label/title for graph

Returns:

directed graph of links

Return type:

networkx.DiGraph

wrdrd.tools.crawl.crawl_url(start_url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Crawl pages starting at start_url

Parameters:

start_url (str) – URL to start crawling from
output (filelike) – file to .write() output to

Returns:

ResultStore of (URL, crawl_status_dict) pairs

Return type:

ResultStore

wrdrd.tools.crawl.current_datetime()[source]¶

Get the current datetime in ISO format

Returns:: current datetime in ISO format
Return type:: str

wrdrd.tools.crawl.expand_link(_src, _url)[source]¶

Expand a link given the containing document’s URI

Parameters:

_src (str) – containing document’s URI
_url (str) – link URI

Returns:

expanded URI

Return type:

str

wrdrd.tools.crawl.extract_css(url, bs)[source]¶

Find CSS <link> links in a given BeautifulSoup object

Parameters:

url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

CSS – a CSS

wrdrd.tools.crawl.extract_images(url, bs)[source]¶

Find <img> images in a given BeautifulSoup object

Parameters:

url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

Image – a Image

wrdrd.tools.crawl.extract_js(url, bs)[source]¶

Find JS <script> links in a given BeautifulSoup object

Parameters:

url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

JS – a JS

wrdrd.tools.crawl.extract_keywords(url, bs)[source]¶

Extract keyword frequencies from a given BeautifulSoup object

Parameters:

url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns:

KeywordFrequency

Return type:

KeywordFrequency

wrdrd.tools.crawl.extract_links(url, bs)[source]¶

Find <a> links in a given BeautifulSoup object

Parameters:

url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

Link – a Link

wrdrd.tools.crawl.extract_words_from_bs(bs)[source]¶

Get just the text from an HTML page

Parameters:: bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:: newline-joined unicode string
Return type:: unicode

wrdrd.tools.crawl.frequency_table(counterdict, sort_by='count')[source]¶

Calculate and sort a frequency table from a collections.Counter dict

Parameters:

counterdict (dict) – a collections.Counter dict of (key, count) pairs
sort_by (str) – either count or name

Yields:

tuple – (%, count, key)

wrdrd.tools.crawl.get_stop_words()[source]¶

Get english stop words from NLTK with a few modifications

Returns:: dictionary of stop words
Return type:: dict

wrdrd.tools.crawl.get_text_from_bs(bs)[source]¶

Get text from a BeautifulSoup object

Parameters:: bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:: space-joined unicode string with newlines replaced by spaces
Return type:: unicode

wrdrd.tools.crawl.get_unicode_stdout(stdout=None, errors='replace', **kwargs)[source]¶

Wrap stdout as a utf-8 unicode writer

Parameters:

stdout (filelike) – sys.stdout
errors (str) – what to do with errors
kwargs (dict) – **kwargs

Returns:

output to .write() to

Return type:

filelike

wrdrd.tools.crawl.iteritems(obj)[source]¶

wrdrd.tools.crawl.itervalues(obj)[source]¶

wrdrd.tools.crawl.main(*args)[source]¶

wrdrd.tools.crawl main method: parse arguments and run commands

Parameters:: args (list) – list of commandline arguments
Returns:: nonzero returncode on error
Return type:: int

wrdrd.tools.crawl.print_frequency_table(frequencies, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Print a formatted ASCII frequency table

Parameters:

frequencies (iterable) – iterable of (%, count, word) tuples
output (filelike) – output to print() to

wrdrd.tools.crawl.same_netloc(url1, url2)[source]¶

Check whether two URIs have the same netloc

Parameters:

url1 (str) – first URI
url2 (str) – second URI

Returns:

True if both URIs have the same netloc

Return type:

bool

wrdrd.tools.crawl.strip_fragment(url)[source]¶

Strip the #fragment portion from a URI

Parameters:: url (str) – URI to strip #fragment from
Returns:: stripped URI (/ if otherwise empty)
Return type:: str

wrdrd.tools.crawl.strip_script_styles_from_bs(bs)[source]¶

Strip <script> and <style> tags from a BeautifulSoup object

Parameters:: bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:: BeautifulSoup object with tags removed
Return type:: bs4.BeautifulSoup

wrdrd.tools.crawl.sum_counters(iterable)[source]¶

Sum the counts of an iterable

Parameters:: iterable (iterable) – iterable of collections.Counter dicts
Returns:: dict of (key, count) pairs
Return type:: defaultdict

wrdrd.tools.crawl.to_a_search_engine(url)[source]¶

Get a list of words (e.g. as a classic search engine)

Parameters:: url (str) – URL to HTTP GET with requests.get
Returns:: iterable of tokens
Return type:: iterable

wrdrd.tools.crawl.tokenize(text)[source]¶

Tokenize the given text with textblob.tokenizers.word_tokenize

Parameters:: text (str) – text to tokenize
Returns:: tokens
Return type:: iterable

wrdrd.tools.crawl.word_frequencies(url, keywords, stopwords=None)[source]¶

Get frequencies (counts) for a set of (non-stopword) keywords

Parameters:

url (str) – URL from which keywords were derived
keywords (iterable) – iterable of keywords

Returns:

KeywordFrequency

Return type:

KeywordFrequency

wrdrd.tools.crawl.wrdcrawler(url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶

Fetch and generate a report from the given URL

Parameters:

url (str) – URL to fetch
output (filelike) – output to print() to

Returns:

output

Return type:

filelike

wrdrd.tools.crawl.write_nxgraph_to_dot(g, output)[source]¶

Write a networkx graph as DOT to the specified output

Parameters:

g (networkx.Graph) – graph to write as DOT
output (filelike) – output to write to

wrdrd.tools.crawl.write_nxgraph_to_json(g, output)[source]¶

Write a networkx graph as JSON to the specified output

Parameters:

g (networkx.Graph) – graph to write as JSON
output (filelike) – output to write to

wrdrd.tools.domain module¶

wrdrd.tools.domain.check_google_dkim(domain, prefix='google')[source]¶

Check a Google DKIM DNS TXT record

Parameters:

domain (str) – DNS domain name
prefix (str) – DKIM s= selector (‘DKIM prefix’)

Returns:

0 if OK, 1 on error

Return type:

int

https://support.google.com/a/answer/174126

https://admin.google.com/AdminHome?fral=1#AppDetails:service=email&flyout=dkim

Note

This check function only finds “v=DKIM1” TXT records; it defaults to the default google prefix and does not validate DKIM signatures.

http://dkim.org/specs/rfc4871-dkimbase.html#rfc.section.3.6.2.1

http://dkim.org/specs/rfc4871-dkimbase.html#rfc.section.A.3

wrdrd.tools.domain.check_google_dmarc(domain)[source]¶

Check a Google DMARC DNS TXT record

Parameters:: domain (str) – DNS domain name
Returns:: 0 if OK, 1 on error
Return type:: int

https://support.google.com/a/answer/2466580

https://support.google.com/a/answer/2466563

wrdrd.tools.domain.check_google_domain(domain, dkim_prefix='google')[source]¶

Check DNS MX, SPF, DMARC, and DKIM records for a Google Apps domain

Parameters:

domain (str) – DNS domain
dkim_prefix (str) – DKIM prefix (<prefix>._domainkey)

Returns:

nonzero returncode on failure (sum of returncodes)

Return type:

int

https://support.google.com/a/answer/2716802

wrdrd.tools.domain.check_google_mx(domain)[source]¶

Check Google MX DNS records

Parameters:: domain (str) – DNS domain name
Returns:: 0 if OK, 1 on error
Return type:: int

https://support.google.com/a/topic/2716885?hl=en&ref_topic=2426592

wrdrd.tools.domain.check_google_spf(domain)[source]¶

Check a Google SPF DNS TXT record

Parameters:: domain (str) – DNS domain name
Returns:: 0 if OK, 1 on error
Return type:: int

https://support.google.com/a/answer/178723?hl=en

wrdrd.tools.domain.dig_all(domain)[source]¶

Get all DNS records with dig

Parameters:: domain (str) – DNS domain
Returns:: dig output
Return type:: str

wrdrd.tools.domain.dig_dnskey(zone)[source]¶

Get DNSSEC DNS records with dig

Parameters:: zone (str) – DNS zone
Returns:: dig output
Return type:: str

wrdrd.tools.domain.dig_mx(domain)[source]¶

Get MX DNS records with dig

Parameters:: domain (str) – DNS domain
Returns:: dig output
Return type:: str

https://en.wikipedia.org/wiki/MX_record

wrdrd.tools.domain.dig_ns(domain)[source]¶

Get DNS NS records with dig

Parameters:: domain (str) – DNS domain
Returns:: dig output
Return type:: str

wrdrd.tools.domain.dig_spf(domain)[source]¶

Get SPF DNS TXT records with dig

Parameters:: domain (str) – DNS domain
Returns:: dig output
Return type:: str

https://en.wikipedia.org/wiki/Sender_Policy_Framework

wrdrd.tools.domain.dig_txt(domain)[source]¶

Get DNS TXT records with dig

Parameters:: domain (str) – DNS domain
Returns:: dig output
Return type:: str

wrdrd.tools.domain.domain_tools(domain)[source]¶

Get whois and DNS information for a domain.

Parameters:: domain (str) – DNS domain name
Returns:: nonzero returncode on failure (sum of returncodes)
Return type:: int

wrdrd.tools.domain.main(*args)[source]¶

wrdrd.tools.domain main method

Parameters:: args (list) – commandline arguments
Returns:: nonzero returncode on failure (sum of returncodes)
Return type:: int

wrdrd.tools.domain.nslookup(domain, nameserver='')[source]¶

Get nslookup information with nslookup (resolve a domainname to an IP)

Parameters:

domain (str) – DNS domain
nameserver (str) – DNS domain name server to query (default: '')

Returns:

nslookup output

Return type:

str

wrdrd.tools.domain.whois(domain)[source]¶

Get whois information with whois

Parameters:: domain (str) – DNS domain
Returns:: whois output
Return type:: str

wrdrd.tools.stripsinglehtml module¶

class wrdrd.tools.stripsinglehtml.Test_stripsinglehtml(methodName='runTest')[source]¶

Bases: TestCase

test_stripsinglehtml()[source]¶

wrdrd.tools.stripsinglehtml.main(*args)[source]¶

wrdrd.tools.stripsinglehtml main method: print unicode stripsinglehtml output to stdout.

Parameters:: args (list) – list of commandline arguments
Returns:: zero
Return type:: int

wrdrd.tools.stripsinglehtml.stripsinglehtml(path='index.html')[source]¶

strip markup from sphinx singlehtml files (rather than writing a sphinx […]-er)

Parameters:: path (str) – path to a Sphinx singlehtml file
Returns:: stripped HTML file
Return type:: bs4.BeautifulSoup

Table of Contents

Previous topic

Next topic

This Page

wrdrd.tools package¶

Submodules¶

wrdrd.tools.crawl module¶

wrdrd.tools.domain module¶

wrdrd.tools.stripsinglehtml module¶