wrdrd.tools package

Submodules

wrdrd.tools.crawl module

class wrdrd.tools.crawl.CSS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.CrawlRequest(src, url, datetime)

Bases: tuple

datetime

Alias for field number 2

src

Alias for field number 0

url

Alias for field number 1

class wrdrd.tools.crawl.Image(loc, src, alt, height, width, text)

Bases: tuple

alt

Alias for field number 2

height

Alias for field number 3

loc

Alias for field number 0

src

Alias for field number 1

text

Alias for field number 5

width

Alias for field number 4

class wrdrd.tools.crawl.JS(loc, src)

Bases: tuple

loc

Alias for field number 0

src

Alias for field number 1

class wrdrd.tools.crawl.KeywordFrequency(url, frequencies)

Bases: tuple

frequencies

Alias for field number 1

url

Alias for field number 0

Bases: tuple

href

Alias for field number 1

loc

Alias for field number 0

name

Alias for field number 2

parent_id

Alias for field number 5

target

Alias for field number 3

text

Alias for field number 4

class wrdrd.tools.crawl.ResultStore[source]

Bases: object

Result store interface

itervalues()[source]

Get an iterable over the values in self.db

Returns:

an iterable over the values in self.db

Return type:

iterable

values()

Get an iterable over the values in self.db

Returns:

an iterable over the values in self.db

Return type:

iterable

class wrdrd.tools.crawl.URLCrawlQueue[source]

Bases: object

Queue of CrawlRequest URLs to crawl and their states

DONE = 2
ERROR = 3
NEW = 0
OUT = 1
count()[source]

Get the count of URLCrawlQueue.NEW CrawlRequest objects

Returns:

count of URLCrawlQueue.NEW CrawlRequest

objects

Return type:

int

done(item)[source]

Mark a CrawlRequest as URLCrawlQueue.DONE

error(item)[source]

Mark a CrawlRequest as URLCrawlQueue.ERROR

pop()[source]

Pop a CrawlRequest off the queue and mark it as URLCrawlQueue.OUT

Returns:

CrawlRequest

Return type:

CrawlRequest

push(item)[source]

Push a CrawlRequest onto the queue and mark it as URLCrawlQueue.NEW

wrdrd.tools.crawl.build_networkx_graph(url, links, label=None)[source]

Build a networkx.DiGraph from an iterable of links from a given URL

Parameters:
  • url (str) – URL from which the given links are derived

  • links (iterable) – iterable of Link objects

  • label (str) – label/title for graph

Returns:

directed graph of links

Return type:

networkx.DiGraph

wrdrd.tools.crawl.crawl_url(start_url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Crawl pages starting at start_url

Parameters:
  • start_url (str) – URL to start crawling from

  • output (filelike) – file to .write() output to

Returns:

ResultStore of (URL, crawl_status_dict) pairs

Return type:

ResultStore

wrdrd.tools.crawl.current_datetime()[source]

Get the current datetime in ISO format

Returns:

current datetime in ISO format

Return type:

str

Expand a link given the containing document’s URI

Parameters:
  • _src (str) – containing document’s URI

  • _url (str) – link URI

Returns:

expanded URI

Return type:

str

wrdrd.tools.crawl.extract_css(url, bs)[source]

Find CSS <link> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

CSS – a CSS

wrdrd.tools.crawl.extract_images(url, bs)[source]

Find <img> images in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

Image – a Image

wrdrd.tools.crawl.extract_js(url, bs)[source]

Find JS <script> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

JS – a JS

wrdrd.tools.crawl.extract_keywords(url, bs)[source]

Extract keyword frequencies from a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns:

KeywordFrequency

Return type:

KeywordFrequency

Find <a> links in a given BeautifulSoup object

Parameters:
  • url (str) – URL of the BeautifulSoup object

  • bs (bs4.BeautifulSoup) – BeautifulSoup object

Yields:

Link – a Link

wrdrd.tools.crawl.extract_words_from_bs(bs)[source]

Get just the text from an HTML page

Parameters:

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns:

newline-joined unicode string

Return type:

unicode

wrdrd.tools.crawl.frequency_table(counterdict, sort_by='count')[source]

Calculate and sort a frequency table from a collections.Counter dict

Parameters:
  • counterdict (dict) – a collections.Counter dict of (key, count) pairs

  • sort_by (str) – either count or name

Yields:

tuple – (%, count, key)

wrdrd.tools.crawl.get_stop_words()[source]

Get english stop words from NLTK with a few modifications

Returns:

dictionary of stop words

Return type:

dict

wrdrd.tools.crawl.get_text_from_bs(bs)[source]

Get text from a BeautifulSoup object

Parameters:

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns:

space-joined unicode string with newlines replaced by spaces

Return type:

unicode

wrdrd.tools.crawl.get_unicode_stdout(stdout=None, errors='replace', **kwargs)[source]

Wrap stdout as a utf-8 unicode writer

Parameters:
  • stdout (filelike) – sys.stdout

  • errors (str) – what to do with errors

  • kwargs (dict) – **kwargs

Returns:

output to .write() to

Return type:

filelike

wrdrd.tools.crawl.iteritems(obj)[source]
wrdrd.tools.crawl.itervalues(obj)[source]
wrdrd.tools.crawl.main(*args)[source]

wrdrd.tools.crawl main method: parse arguments and run commands

Parameters:

args (list) – list of commandline arguments

Returns:

nonzero returncode on error

Return type:

int

wrdrd.tools.crawl.print_frequency_table(frequencies, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Print a formatted ASCII frequency table

Parameters:
  • frequencies (iterable) – iterable of (%, count, word) tuples

  • output (filelike) – output to print() to

wrdrd.tools.crawl.same_netloc(url1, url2)[source]

Check whether two URIs have the same netloc

Parameters:
  • url1 (str) – first URI

  • url2 (str) – second URI

Returns:

True if both URIs have the same netloc

Return type:

bool

wrdrd.tools.crawl.strip_fragment(url)[source]

Strip the #fragment portion from a URI

Parameters:

url (str) – URI to strip #fragment from

Returns:

stripped URI (/ if otherwise empty)

Return type:

str

wrdrd.tools.crawl.strip_script_styles_from_bs(bs)[source]

Strip <script> and <style> tags from a BeautifulSoup object

Parameters:

bs (bs4.BeautifulSoup) – BeautifulSoup object

Returns:

BeautifulSoup object with tags removed

Return type:

bs4.BeautifulSoup

wrdrd.tools.crawl.sum_counters(iterable)[source]

Sum the counts of an iterable

Parameters:

iterable (iterable) – iterable of collections.Counter dicts

Returns:

dict of (key, count) pairs

Return type:

defaultdict

wrdrd.tools.crawl.to_a_search_engine(url)[source]

Get a list of words (e.g. as a classic search engine)

Parameters:

url (str) – URL to HTTP GET with requests.get

Returns:

iterable of tokens

Return type:

iterable

wrdrd.tools.crawl.tokenize(text)[source]

Tokenize the given text with textblob.tokenizers.word_tokenize

Parameters:

text (str) – text to tokenize

Returns:

tokens

Return type:

iterable

wrdrd.tools.crawl.word_frequencies(url, keywords, stopwords=None)[source]

Get frequencies (counts) for a set of (non-stopword) keywords

Parameters:
  • url (str) – URL from which keywords were derived

  • keywords (iterable) – iterable of keywords

Returns:

KeywordFrequency

Return type:

KeywordFrequency

wrdrd.tools.crawl.wrdcrawler(url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]

Fetch and generate a report from the given URL

Parameters:
  • url (str) – URL to fetch

  • output (filelike) – output to print() to

Returns:

output

Return type:

filelike

wrdrd.tools.crawl.write_nxgraph_to_dot(g, output)[source]

Write a networkx graph as DOT to the specified output

Parameters:
  • g (networkx.Graph) – graph to write as DOT

  • output (filelike) – output to write to

wrdrd.tools.crawl.write_nxgraph_to_json(g, output)[source]

Write a networkx graph as JSON to the specified output

Parameters:
  • g (networkx.Graph) – graph to write as JSON

  • output (filelike) – output to write to

wrdrd.tools.domain module

wrdrd.tools.domain.check_google_dkim(domain, prefix='google')[source]

Check a Google DKIM DNS TXT record

Parameters:
  • domain (str) – DNS domain name

  • prefix (str) – DKIM s= selector (‘DKIM prefix’)

Returns:

0 if OK, 1 on error

Return type:

int

Note

This check function only finds “v=DKIM1” TXT records; it defaults to the default google prefix and does not validate DKIM signatures.

wrdrd.tools.domain.check_google_dmarc(domain)[source]

Check a Google DMARC DNS TXT record

Parameters:

domain (str) – DNS domain name

Returns:

0 if OK, 1 on error

Return type:

int

wrdrd.tools.domain.check_google_domain(domain, dkim_prefix='google')[source]

Check DNS MX, SPF, DMARC, and DKIM records for a Google Apps domain

Parameters:
  • domain (str) – DNS domain

  • dkim_prefix (str) – DKIM prefix (<prefix>._domainkey)

Returns:

nonzero returncode on failure (sum of returncodes)

Return type:

int

wrdrd.tools.domain.check_google_mx(domain)[source]

Check Google MX DNS records

Parameters:

domain (str) – DNS domain name

Returns:

0 if OK, 1 on error

Return type:

int

wrdrd.tools.domain.check_google_spf(domain)[source]

Check a Google SPF DNS TXT record

Parameters:

domain (str) – DNS domain name

Returns:

0 if OK, 1 on error

Return type:

int

wrdrd.tools.domain.dig_all(domain)[source]

Get all DNS records with dig

Parameters:

domain (str) – DNS domain

Returns:

dig output

Return type:

str

wrdrd.tools.domain.dig_dnskey(zone)[source]

Get DNSSEC DNS records with dig

Parameters:

zone (str) – DNS zone

Returns:

dig output

Return type:

str

wrdrd.tools.domain.dig_mx(domain)[source]

Get MX DNS records with dig

Parameters:

domain (str) – DNS domain

Returns:

dig output

Return type:

str

wrdrd.tools.domain.dig_ns(domain)[source]

Get DNS NS records with dig

Parameters:

domain (str) – DNS domain

Returns:

dig output

Return type:

str

wrdrd.tools.domain.dig_spf(domain)[source]

Get SPF DNS TXT records with dig

Parameters:

domain (str) – DNS domain

Returns:

dig output

Return type:

str

wrdrd.tools.domain.dig_txt(domain)[source]

Get DNS TXT records with dig

Parameters:

domain (str) – DNS domain

Returns:

dig output

Return type:

str

wrdrd.tools.domain.domain_tools(domain)[source]

Get whois and DNS information for a domain.

Parameters:

domain (str) – DNS domain name

Returns:

nonzero returncode on failure (sum of returncodes)

Return type:

int

wrdrd.tools.domain.main(*args)[source]

wrdrd.tools.domain main method

Parameters:

args (list) – commandline arguments

Returns:

nonzero returncode on failure (sum of returncodes)

Return type:

int

wrdrd.tools.domain.nslookup(domain, nameserver='')[source]

Get nslookup information with nslookup (resolve a domainname to an IP)

Parameters:
  • domain (str) – DNS domain

  • nameserver (str) – DNS domain name server to query (default: '')

Returns:

nslookup output

Return type:

str

wrdrd.tools.domain.whois(domain)[source]

Get whois information with whois

Parameters:

domain (str) – DNS domain

Returns:

whois output

Return type:

str

wrdrd.tools.stripsinglehtml module

class wrdrd.tools.stripsinglehtml.Test_stripsinglehtml(methodName='runTest')[source]

Bases: TestCase

test_stripsinglehtml()[source]
wrdrd.tools.stripsinglehtml.main(*args)[source]

wrdrd.tools.stripsinglehtml main method: print unicode stripsinglehtml output to stdout.

Parameters:

args (list) – list of commandline arguments

Returns:

zero

Return type:

int

wrdrd.tools.stripsinglehtml.stripsinglehtml(path='index.html')[source]

strip markup from sphinx singlehtml files (rather than writing a sphinx […]-er)

Parameters:

path (str) – path to a Sphinx singlehtml file

Returns:

stripped HTML file

Return type:

bs4.BeautifulSoup