wrdrd.tools package

Submodules

wrdrd.tools.crawl module

class wrdrd.tools.crawl.CSS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.CrawlRequest(src, url, datetime)

Bases: tuple

datetime

Alias for field number 2

src

Alias for field number 0

url

Alias for field number 1

class wrdrd.tools.crawl.Image(loc, src, alt, height, width, text)

Bases: tuple

alt

Alias for field number 2

height

Alias for field number 3

loc

Alias for field number 0

src

Alias for field number 1

text

Alias for field number 5

width

Alias for field number 4

class wrdrd.tools.crawl.JS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.KeywordFrequency(url, frequencies)

Bases: tuple

frequencies

Alias for field number 1

url

Alias for field number 0

Bases: tuple

href

Alias for field number 1

loc

Alias for field number 0

name

Alias for field number 2

parent_id

Alias for field number 5

target

Alias for field number 3

text

Alias for field number 4

class wrdrd.tools.crawl.ResultStore[source]

Bases: object

Result store interface

itervalues()[source]

Get an iterable over the values in self.db

Returns

an iterable over the values in self.db

Return type

iterable

values()

Get an iterable over the values in self.db

Returns

an iterable over the values in self.db

Return type

iterable

class wrdrd.tools.crawl.URLCrawlQueue[source]

Bases: object

Queue of CrawlRequest URLs to crawl and their states

DONE = 2
ERROR = 3
NEW = 0
OUT = 1
count()[source]

Get the count of URLCrawlQueue.NEW CrawlRequest objects

Returns

count of URLCrawlQueue.NEW CrawlRequest

objects

Return type

int

done(item)[source]

Mark a CrawlRequest as URLCrawlQueue.DONE

error(item)[source]

Mark a CrawlRequest as URLCrawlQueue.ERROR

pop()[source]

Pop a CrawlRequest off the queue and mark it as URLCrawlQueue.OUT

Returns

CrawlRequest

Return type

CrawlRequest

push(item)[source]

Push a CrawlRequest onto the queue and mark it as URLCrawlQueue.NEW

wrdrd.tools.crawl.build_networkx_graph(url, links, label=None)[source]

Build a networkx.DiGraph from an iterable of links from a given URL

Parameters
  • url (str) – URL from which the given links are derived

  • links (iterable) – iterable of Link objects

  • label (str) – label/title for graph

Returns

directed graph of links

Return type

networkx.DiGraph

wrdrd.tools.crawl.crawl_url(start_url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Crawl pages starting at start_url

Parameters
  • start_url (str) – URL to start crawling from

  • output (filelike) – file to .write() output to

Returns

ResultStore of (URL, crawl_status_dict) pairs

Return type

ResultStore

wrdrd.tools.crawl.current_datetime()[source]

Get the current datetime in ISO format

Returns

current datetime in ISO format

Return type

str

Expand a link given the containing document’s URI

Parameters
  • _src (str) – containing document’s URI

  • _url (str) – link URI

Returns

expanded URI

Return type

str

wrdrd.tools.crawl.extract_css(url, bs)[source]

Find CSS <link> links in a given BeautifulSoup object

Parameters
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields

CSS – a CSS

wrdrd.tools.crawl.extract_images(url, bs)[source]

Find <img> images in a given BeautifulSoup object

Parameters
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields

Image – a Image

wrdrd.tools.crawl.extract_js(url, bs)[source]

Find JS <script> links in a given BeautifulSoup object

Parameters
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields

JS – a JS

wrdrd.tools.crawl.extract_keywords(url, bs)[source]

Extract keyword frequencies from a given BeautifulSoup object

Parameters
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns

KeywordFrequency

Return type

KeywordFrequency

Find <a> links in a given BeautifulSoup object

Parameters
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields

Link – a Link

wrdrd.tools.crawl.extract_words_from_bs(bs)[source]

Get just the text from an HTML page

Parameters

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns

newline-joined unicode string

Return type

unicode

wrdrd.tools.crawl.frequency_table(counterdict, sort_by='count')[source]

Calculate and sort a frequency table from a collections.Counter dict

Parameters
  • counterdict (dict) – a collections.Counter dict of (key, count) pairs

  • sort_by (str) – either count or name

Yields

tuple – (%, count, key)

wrdrd.tools.crawl.get_stop_words()[source]

Get english stop words from NLTK with a few modifications

Returns

dictionary of stop words

Return type

dict

wrdrd.tools.crawl.get_text_from_bs(bs)[source]

Get text from a BeautifulSoup object

Parameters

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns

space-joined unicode string with newlines replaced by spaces

Return type

unicode

wrdrd.tools.crawl.get_unicode_stdout(stdout=None, errors='replace', **kwargs)[source]

Wrap stdout as a utf-8 unicode writer

Parameters
  • stdout (filelike) – sys.stdout

  • errors (str) – what to do with errors

  • kwargs (dict) – **kwargs

Returns

output to .write() to

Return type

filelike

wrdrd.tools.crawl.iteritems(obj)[source]
wrdrd.tools.crawl.itervalues(obj)[source]
wrdrd.tools.crawl.main(*args)[source]

wrdrd.tools.crawl main method: parse arguments and run commands

Parameters

args (list) – list of commandline arguments

Returns

nonzero returncode on error

Return type

int

wrdrd.tools.crawl.print_frequency_table(frequencies, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Print a formatted ASCII frequency table

Parameters
  • frequencies (iterable) – iterable of (%, count, word) tuples

  • output (filelike) – output to print() to

wrdrd.tools.crawl.same_netloc(url1, url2)[source]

Check whether two URIs have the same netloc

Parameters
  • url1 (str) – first URI

  • url2 (str) – second URI

Returns

True if both URIs have the same netloc

Return type

bool

wrdrd.tools.crawl.strip_fragment(url)[source]

Strip the #fragment portion from a URI

Parameters

url (str) – URI to strip #fragment from

Returns

stripped URI (/ if otherwise empty)

Return type

str

wrdrd.tools.crawl.strip_script_styles_from_bs(bs)[source]

Strip <script> and <style> tags from a BeautifulSoup object

Parameters

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns

BeautifulSoup object with tags removed

Return type

bs4.BeautifulSoup

wrdrd.tools.crawl.sum_counters(iterable)[source]

Sum the counts of an iterable

Parameters

iterable (iterable) – iterable of collections.Counter dicts

Returns

dict of (key, count) pairs

Return type

defaultdict

wrdrd.tools.crawl.to_a_search_engine(url)[source]

Get a list of words (e.g. as a classic search engine)

Parameters

url (str) – URL to HTTP GET with requests.get

Returns

iterable of tokens

Return type

iterable

wrdrd.tools.crawl.tokenize(text)[source]

Tokenize the given text with textblob.tokenizers.word_tokenize

Parameters

text (str) – text to tokenize

Returns

tokens

Return type

iterable

wrdrd.tools.crawl.word_frequencies(url, keywords, stopwords=None)[source]

Get frequencies (counts) for a set of (non-stopword) keywords

Parameters
  • url (str) – URL from which keywords were derived

  • keywords (iterable) – iterable of keywords

Returns

KeywordFrequency

Return type

KeywordFrequency

wrdrd.tools.crawl.wrdcrawler(url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Fetch and generate a report from the given URL

Parameters
  • url (str) – URL to fetch

  • output (filelike) – output to print() to

Returns

output

Return type

filelike

wrdrd.tools.crawl.write_nxgraph_to_dot(g, output)[source]

Write a networkx graph as DOT to the specified output

Parameters
  • g (networkx.Graph) – graph to write as DOT

  • output (filelike) – output to write to

wrdrd.tools.crawl.write_nxgraph_to_json(g, output)[source]

Write a networkx graph as JSON to the specified output

Parameters
  • g (networkx.Graph) – graph to write as JSON

  • output (filelike) – output to write to

wrdrd.tools.domain module

wrdrd.tools.domain.check_google_dkim(domain, prefix='google')[source]

Check a Google DKIM DNS TXT record

Parameters
  • domain (str) – DNS domain name

  • prefix (str) – DKIM s= selector (‘DKIM prefix’)

Returns

0 if OK, 1 on error

Return type

int

Note

This check function only finds “v=DKIM1” TXT records; it defaults to the default google prefix and does not validate DKIM signatures.

wrdrd.tools.domain.check_google_dmarc(domain)[source]

Check a Google DMARC DNS TXT record

Parameters

domain (str) – DNS domain name

Returns

0 if OK, 1 on error

Return type

int

wrdrd.tools.domain.check_google_domain(domain, dkim_prefix='google')[source]

Check DNS MX, SPF, DMARC, and DKIM records for a Google Apps domain

Parameters
  • domain (str) – DNS domain

  • dkim_prefix (str) – DKIM prefix (<prefix>._domainkey)

Returns

nonzero returncode on failure (sum of returncodes)

Return type

int

wrdrd.tools.domain.check_google_mx(domain)[source]

Check Google MX DNS records

Parameters

domain (str) – DNS domain name

Returns

0 if OK, 1 on error

Return type

int

wrdrd.tools.domain.check_google_spf(domain)[source]

Check a Google SPF DNS TXT record

Parameters

domain (str) – DNS domain name

Returns

0 if OK, 1 on error

Return type

int

wrdrd.tools.domain.dig_all(domain)[source]

Get all DNS records with dig

Parameters

domain (str) – DNS domain

Returns

dig output

Return type

str

wrdrd.tools.domain.dig_dnskey(zone)[source]

Get DNSSEC DNS records with dig

Parameters

zone (str) – DNS zone

Returns

dig output

Return type

str

wrdrd.tools.domain.dig_mx(domain)[source]

Get MX DNS records with dig

Parameters

domain (str) – DNS domain

Returns

dig output

Return type

str

wrdrd.tools.domain.dig_ns(domain)[source]

Get DNS NS records with dig

Parameters

domain (str) – DNS domain

Returns

dig output

Return type

str

wrdrd.tools.domain.dig_spf(domain)[source]

Get SPF DNS TXT records with dig

Parameters

domain (str) – DNS domain

Returns

dig output

Return type

str

wrdrd.tools.domain.dig_txt(domain)[source]

Get DNS TXT records with dig

Parameters

domain (str) – DNS domain

Returns

dig output

Return type

str

wrdrd.tools.domain.domain_tools(domain)[source]

Get whois and DNS information for a domain.

Parameters

domain (str) – DNS domain name

Returns

nonzero returncode on failure (sum of returncodes)

Return type

int

wrdrd.tools.domain.main(*args)[source]

wrdrd.tools.domain main method

Parameters

args (list) – commandline arguments

Returns

nonzero returncode on failure (sum of returncodes)

Return type

int

wrdrd.tools.domain.nslookup(domain, nameserver='')[source]

Get nslookup information with nslookup (resolve a domainname to an IP)

Parameters
  • domain (str) – DNS domain

  • nameserver (str) – DNS domain name server to query (default: '')

Returns

nslookup output

Return type

str

wrdrd.tools.domain.whois(domain)[source]

Get whois information with whois

Parameters

domain (str) – DNS domain

Returns

whois output

Return type

str

wrdrd.tools.stripsinglehtml module

class wrdrd.tools.stripsinglehtml.Test_stripsinglehtml(methodName='runTest')[source]

Bases: unittest.case.TestCase

test_stripsinglehtml()[source]
wrdrd.tools.stripsinglehtml.main(*args)[source]

wrdrd.tools.stripsinglehtml main method: print unicode stripsinglehtml output to stdout.

Parameters

args (list) – list of commandline arguments

Returns

zero

Return type

int

wrdrd.tools.stripsinglehtml.stripsinglehtml(path='index.html')[source]

strip markup from sphinx singlehtml files (rather than writing a sphinx […]-er)

Parameters

path (str) – path to a Sphinx singlehtml file

Returns

stripped HTML file

Return type

bs4.BeautifulSoup