wrdrd.tools package

Submodules

wrdrd.tools.crawl module

class wrdrd.tools.crawl.CSS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.CrawlRequest(src, url, datetime)

Bases: tuple

datetime

Alias for field number 2

src

Alias for field number 0

url

Alias for field number 1

class wrdrd.tools.crawl.Image(loc, src, alt, height, width, text)

Bases: tuple

alt

Alias for field number 2

height

Alias for field number 3

loc

Alias for field number 0

src

Alias for field number 1

text

Alias for field number 5

width

Alias for field number 4

class wrdrd.tools.crawl.JS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.KeywordFrequency(url, frequencies)

Bases: tuple

frequencies

Alias for field number 1

url

Alias for field number 0

Bases: tuple

href

Alias for field number 1

loc

Alias for field number 0

name

Alias for field number 2

parent_id

Alias for field number 5

target

Alias for field number 3

text

Alias for field number 4

class wrdrd.tools.crawl.ResultStore[source]

Bases: object

Result store interface

itervalues()[source]

Get an iterable over the values in self.db

Returns:an iterable over the values in self.db
Return type:iterable
class wrdrd.tools.crawl.Test_wrdcrawler(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_build_networkx_graph()[source]
test_crawl_url()[source]
test_other()[source]
test_same_netloc()[source]
test_strip_fragment()[source]
test_sum_counters()[source]
test_tokenize()[source]
test_wrdcrawler()[source]
class wrdrd.tools.crawl.URLCrawlQueue[source]

Bases: object

Queue of CrawlRequest URLs to crawl and their states

DONE = 2
ERROR = 3
NEW = 0
OUT = 1
count()[source]

Get the count of URLCrawlQueue.NEW CrawlRequest objects

Returns:count of URLCrawlQueue.NEW CrawlRequest objects
Return type:int
done(item)[source]

Mark a CrawlRequest as URLCrawlQueue.DONE

error(item)[source]

Mark a CrawlRequest as URLCrawlQueue.ERROR

pop()[source]

Pop a CrawlRequest off the queue and mark it as URLCrawlQueue.OUT

Returns:CrawlRequest
Return type:CrawlRequest
push(item)[source]

Push a CrawlRequest onto the queue and mark it as URLCrawlQueue.NEW

wrdrd.tools.crawl.build_networkx_graph(url, links, label=None)[source]

Build a networkx.DiGraph from an iterable of links from a given URL

Parameters:
  • url (str) – URL from which the given links are derived
  • links (iterable) – iterable of Link objects
  • label (str) – label/title for graph
Returns:

directed graph of links

Return type:

networkx.DiGraph

wrdrd.tools.crawl.crawl_url(start_url, output=<open file '<stdout>', mode 'w'>)[source]

Crawl pages starting at start_url

Parameters:
  • start_url (str) – URL to start crawling from
  • output (filelike) – file to .write() output to
Returns:

ResultStore of (URL, crawl_status_dict) pairs

Return type:

ResultStore

wrdrd.tools.crawl.current_datetime()[source]

Get the current datetime in ISO format

Returns:current datetime in ISO format
Return type:str

Expand a link given the containing document’s URI

Parameters:
  • _src (str) – containing document’s URI
  • _url (str) – link URI
Returns:

expanded URI

Return type:

str

wrdrd.tools.crawl.extract_css(url, bs)[source]

Find CSS <link> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object
  • bs (bs4.BeautifulSoup) – BeautifulSoup object
Yields:

CSS – a CSS

wrdrd.tools.crawl.extract_images(url, bs)[source]

Find <img> images in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object
  • bs (bs4.BeautifulSoup) – BeautifulSoup object
Yields:

Image – a Image

wrdrd.tools.crawl.extract_js(url, bs)[source]

Find JS <script> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object
  • bs (bs4.BeautifulSoup) – BeautifulSoup object
Yields:

JS – a JS

wrdrd.tools.crawl.extract_keywords(url, bs)[source]

Extract keyword frequencies from a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object
  • bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:

KeywordFrequency

Return type:

KeywordFrequency

Find <a> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object
  • bs (bs4.BeautifulSoup) – BeautifulSoup object
Yields:

Link – a Link

wrdrd.tools.crawl.extract_words_from_bs(bs)[source]

Get just the text from an HTML page

Parameters:bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:newline-joined unicode string
Return type:unicode
wrdrd.tools.crawl.frequency_table(counterdict, sort_by='count')[source]

Calculate and sort a frequency table from a collections.Counter dict

Parameters:
  • counterdict (dict) – a collections.Counter dict of (key, count) pairs
  • sort_by (str) – either count or name
Yields:

tuple – (%, count, key)

wrdrd.tools.crawl.get_stop_words()[source]

Get english stop words from NLTK with a few modifications

Returns:dictionary of stop words
Return type:dict
wrdrd.tools.crawl.get_text_from_bs(bs)[source]

Get text from a BeautifulSoup object

Parameters:bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:space-joined unicode string with newlines replaced by spaces
Return type:unicode
wrdrd.tools.crawl.get_unicode_stdout(stdout=None, errors='replace', **kwargs)[source]

Wrap stdout as a utf-8 unicode writer

Parameters:
  • stdout (filelike) – sys.stdout
  • errors (str) – what to do with errors
  • kwargs (dict) – **kwargs
Returns:

output to .write() to

Return type:

filelike

wrdrd.tools.crawl.main(*args)[source]

wrdrd.tools.crawl main method: parse arguments and run commands

Parameters:args (list) – list of commandline arguments
Returns:nonzero returncode on error
Return type:int
wrdrd.tools.crawl.print_frequency_table(frequencies, output=<open file '<stdout>', mode 'w'>)[source]

Print a formatted ASCII frequency table

Parameters:
  • frequencies (iterable) – iterable of (%, count, word) tuples
  • output (filelike) – output to print() to
wrdrd.tools.crawl.same_netloc(url1, url2)[source]

Check whether two URIs have the same netloc

Parameters:
  • url1 (str) – first URI
  • url2 (str) – second URI
Returns:

True if both URIs have the same netloc

Return type:

bool

wrdrd.tools.crawl.strip_fragment(url)[source]

Strip the #fragment portion from a URI

Parameters:url (str) – URI to strip #fragment from
Returns:stripped URI (/ if otherwise empty)
Return type:str
wrdrd.tools.crawl.strip_script_styles_from_bs(bs)[source]

Strip <script> and <style> tags from a BeautifulSoup object

Parameters:bs (bs4.BeautifulSoup) – BeautifulSoup object
Returns:BeautifulSoup object with tags removed
Return type:bs4.BeautifulSoup
wrdrd.tools.crawl.sum_counters(iterable)[source]

Sum the counts of an iterable

Parameters:iterable (iterable) – iterable of collections.Counter dicts
Returns:dict of (key, count) pairs
Return type:defaultdict
wrdrd.tools.crawl.to_a_search_engine(url)[source]

Get a list of words (e.g. as a classic search engine)

Parameters:url (str) – URL to HTTP GET with requests.get
Returns:iterable of tokens
Return type:iterable
wrdrd.tools.crawl.tokenize(text)[source]

Tokenize the given text with textblob.tokenizers.word_tokenize

Parameters:text (str) – text to tokenize
Returns:tokens
Return type:iterable
wrdrd.tools.crawl.word_frequencies(url, keywords)[source]

Get frequencies (counts) for a set of (non-stopword) keywords

Parameters:
  • url (str) – URL from which keywords were derived
  • keywords (iterable) – iterable of keywords
Returns:

KeywordFrequency

Return type:

KeywordFrequency

wrdrd.tools.crawl.wrdcrawler(url, output=<open file '<stdout>', mode 'w'>)[source]

Fetch and generate a report from the given URL

Parameters:
  • url (str) – URL to fetch
  • output (filelike) – output to print() to
Returns:

output

Return type:

filelike

wrdrd.tools.crawl.write_nxgraph_to_dot(g, output)[source]

Write a networkx graph as DOT to the specified output

Parameters:
  • g (networkx.Graph) – graph to write as DOT
  • output (filelike) – output to write to
wrdrd.tools.crawl.write_nxgraph_to_json(g, output)[source]

Write a networkx graph as JSON to the specified output

Parameters:
  • g (networkx.Graph) – graph to write as JSON
  • output (filelike) – output to write to

wrdrd.tools.domain module

wrdrd.tools.domain.check_google_dkim(domain, prefix='google')[source]

Check a Google DKIM DNS TXT record

Parameters:
  • domain (str) – DNS domain name
  • prefix (str) – DKIM s= selector (‘DKIM prefix’)
Returns:

0 if OK, 1 on error

Return type:

int

Note

This check function only finds “v=DKIM1” TXT records; it defaults to the default google prefix and does not validate DKIM signatures.

wrdrd.tools.domain.check_google_dmarc(domain)[source]

Check a Google DMARC DNS TXT record

Parameters:domain (str) – DNS domain name
Returns:0 if OK, 1 on error
Return type:int
wrdrd.tools.domain.check_google_domain(domain, dkim_prefix='google')[source]

Check DNS MX, SPF, DMARC, and DKIM records for a Google Apps domain

Parameters:
  • domain (str) – DNS domain
  • dkim_prefix (str) – DKIM prefix (<prefix>._domainkey)
Returns:

nonzero returncode on failure (sum of returncodes)

Return type:

int

wrdrd.tools.domain.check_google_mx(domain)[source]

Check Google MX DNS records

Parameters:domain (str) – DNS domain name
Returns:0 if OK, 1 on error
Return type:int
wrdrd.tools.domain.check_google_spf(domain)[source]

Check a Google SPF DNS TXT record

Parameters:domain (str) – DNS domain name
Returns:0 if OK, 1 on error
Return type:int
wrdrd.tools.domain.dig_all(domain)[source]

Get all DNS records with dig

Parameters:domain (str) – DNS domain
Returns:dig output
Return type:str
wrdrd.tools.domain.dig_dnskey(zone)[source]

Get DNSSEC DNS records with dig

Parameters:zone (str) – DNS zone
Returns:dig output
Return type:str
wrdrd.tools.domain.dig_mx(domain)[source]

Get MX DNS records with dig

Parameters:domain (str) – DNS domain
Returns:dig output
Return type:str
wrdrd.tools.domain.dig_ns(domain)[source]

Get DNS NS records with dig

Parameters:domain (str) – DNS domain
Returns:dig output
Return type:str
wrdrd.tools.domain.dig_spf(domain)[source]

Get SPF DNS TXT records with dig

Parameters:domain (str) – DNS domain
Returns:dig output
Return type:str
wrdrd.tools.domain.dig_txt(domain)[source]

Get DNS TXT records with dig

Parameters:domain (str) – DNS domain
Returns:dig output
Return type:str
wrdrd.tools.domain.domain_tools(domain)[source]

Get whois and DNS information for a domain.

Parameters:domain (str) – DNS domain name
Returns:nonzero returncode on failure (sum of returncodes)
Return type:int
wrdrd.tools.domain.main(*args)[source]

wrdrd.tools.domain main method

Parameters:args (list) – commandline arguments
Returns:nonzero returncode on failure (sum of returncodes)
Return type:int
wrdrd.tools.domain.nslookup(domain, nameserver='')[source]

Get nslookup information with nslookup (resolve a domainname to an IP)

Parameters:
  • domain (str) – DNS domain
  • nameserver (str) – DNS domain name server to query (default: '')
Returns:

nslookup output

Return type:

str

wrdrd.tools.domain.whois(domain)[source]

Get whois information with whois

Parameters:domain (str) – DNS domain
Returns:whois output
Return type:str

wrdrd.tools.stripsinglehtml module

class wrdrd.tools.stripsinglehtml.Test_stripsinglehtml(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_stripsinglehtml()[source]
wrdrd.tools.stripsinglehtml.main(*args)[source]

wrdrd.tools.stripsinglehtml main method: print unicode stripsinglehtml output to stdout.

Parameters:args (list) – list of commandline arguments
Returns:zero
Return type:int
wrdrd.tools.stripsinglehtml.stripsinglehtml(path='index.html')[source]

strip markup from sphinx singlehtml files (rather than writing a sphinx [...]-er)

Parameters:path (str) – path to a Sphinx singlehtml file
Returns:stripped HTML file
Return type:bs4.BeautifulSoup