wrdrd.tools package¶
Submodules¶
wrdrd.tools.crawl module¶
- class wrdrd.tools.crawl.CSS(loc, src)¶
Bases:
tuple
- loc¶
Alias for field number 0
- src¶
Alias for field number 1
- class wrdrd.tools.crawl.CrawlRequest(src, url, datetime)¶
Bases:
tuple
- datetime¶
Alias for field number 2
- src¶
Alias for field number 0
- url¶
Alias for field number 1
- class wrdrd.tools.crawl.Image(loc, src, alt, height, width, text)¶
Bases:
tuple
- alt¶
Alias for field number 2
- height¶
Alias for field number 3
- loc¶
Alias for field number 0
- src¶
Alias for field number 1
- text¶
Alias for field number 5
- width¶
Alias for field number 4
- class wrdrd.tools.crawl.JS(loc, src)¶
Bases:
tuple
- loc¶
Alias for field number 0
- src¶
Alias for field number 1
- class wrdrd.tools.crawl.KeywordFrequency(url, frequencies)¶
Bases:
tuple
- frequencies¶
Alias for field number 1
- url¶
Alias for field number 0
- class wrdrd.tools.crawl.Link(loc, href, name, target, text, parent_id)¶
Bases:
tuple
- href¶
Alias for field number 1
- loc¶
Alias for field number 0
- name¶
Alias for field number 2
- parent_id¶
Alias for field number 5
- target¶
Alias for field number 3
- text¶
Alias for field number 4
- class wrdrd.tools.crawl.ResultStore[source]¶
Bases:
object
Result store interface
- itervalues()[source]¶
Get an iterable over the values in
self.db
- Returns:
an iterable over the values in
self.db
- Return type:
iterable
- values()¶
Get an iterable over the values in
self.db
- Returns:
an iterable over the values in
self.db
- Return type:
iterable
- class wrdrd.tools.crawl.URLCrawlQueue[source]¶
Bases:
object
Queue of CrawlRequest URLs to crawl and their states
- DONE = 2¶
- ERROR = 3¶
- NEW = 0¶
- OUT = 1¶
- count()[source]¶
Get the count of
URLCrawlQueue.NEW
CrawlRequest
objects- Returns:
- count of
URLCrawlQueue.NEW
CrawlRequest
objects
- count of
- Return type:
int
- wrdrd.tools.crawl.build_networkx_graph(url, links, label=None)[source]¶
Build a networkx.DiGraph from an iterable of links from a given URL
- Parameters:
url (str) – URL from which the given links are derived
links (iterable) – iterable of
Link
objectslabel (str) – label/title for graph
- Returns:
directed graph of links
- Return type:
networkx.DiGraph
- wrdrd.tools.crawl.crawl_url(start_url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Crawl pages starting at
start_url
- Parameters:
start_url (str) – URL to start crawling from
output (filelike) – file to
.write()
output to
- Returns:
ResultStore
of (URL, crawl_status_dict) pairs- Return type:
- wrdrd.tools.crawl.current_datetime()[source]¶
Get the current datetime in ISO format
- Returns:
current datetime in ISO format
- Return type:
str
- wrdrd.tools.crawl.expand_link(_src, _url)[source]¶
Expand a link given the containing document’s URI
- Parameters:
_src (str) – containing document’s URI
_url (str) – link URI
- Returns:
expanded URI
- Return type:
str
- wrdrd.tools.crawl.extract_css(url, bs)[source]¶
Find CSS
<link>
links in a given BeautifulSoup object- Parameters:
url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Yields:
CSS – a
CSS
- wrdrd.tools.crawl.extract_images(url, bs)[source]¶
Find
<img>
images in a given BeautifulSoup object- Parameters:
url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Yields:
Image – a
Image
- wrdrd.tools.crawl.extract_js(url, bs)[source]¶
Find JS
<script>
links in a given BeautifulSoup object- Parameters:
url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Yields:
JS – a
JS
- wrdrd.tools.crawl.extract_keywords(url, bs)[source]¶
Extract keyword frequencies from a given BeautifulSoup object
- Parameters:
url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Returns:
- Return type:
- wrdrd.tools.crawl.extract_links(url, bs)[source]¶
Find
<a>
links in a given BeautifulSoup object- Parameters:
url (str) – URL of the BeautifulSoup object
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Yields:
Link – a
Link
- wrdrd.tools.crawl.extract_words_from_bs(bs)[source]¶
Get just the text from an HTML page
- Parameters:
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Returns:
newline-joined unicode string
- Return type:
unicode
- wrdrd.tools.crawl.frequency_table(counterdict, sort_by='count')[source]¶
Calculate and sort a frequency table from a collections.Counter dict
- Parameters:
counterdict (dict) – a collections.Counter dict of (key, count) pairs
sort_by (str) – either
count
orname
- Yields:
tuple – (%, count, key)
- wrdrd.tools.crawl.get_stop_words()[source]¶
Get english stop words from NLTK with a few modifications
- Returns:
dictionary of stop words
- Return type:
dict
- wrdrd.tools.crawl.get_text_from_bs(bs)[source]¶
Get text from a BeautifulSoup object
- Parameters:
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Returns:
space-joined unicode string with newlines replaced by spaces
- Return type:
unicode
- wrdrd.tools.crawl.get_unicode_stdout(stdout=None, errors='replace', **kwargs)[source]¶
Wrap stdout as a utf-8 unicode writer
- Parameters:
stdout (filelike) –
sys.stdout
errors (str) – what to do with errors
kwargs (dict) –
**kwargs
- Returns:
output to
.write()
to- Return type:
filelike
- wrdrd.tools.crawl.main(*args)[source]¶
wrdrd.tools.crawl
main method: parse arguments and run commands- Parameters:
args (list) – list of commandline arguments
- Returns:
nonzero returncode on error
- Return type:
int
- wrdrd.tools.crawl.print_frequency_table(frequencies, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Print a formatted ASCII frequency table
- Parameters:
frequencies (iterable) – iterable of (%, count, word) tuples
output (filelike) – output to
print()
to
- wrdrd.tools.crawl.same_netloc(url1, url2)[source]¶
Check whether two URIs have the same netloc
- Parameters:
url1 (str) – first URI
url2 (str) – second URI
- Returns:
True if both URIs have the same netloc
- Return type:
bool
- wrdrd.tools.crawl.strip_fragment(url)[source]¶
Strip the #fragment portion from a URI
- Parameters:
url (str) – URI to strip #fragment from
- Returns:
stripped URI (
/
if otherwise empty)- Return type:
str
- wrdrd.tools.crawl.strip_script_styles_from_bs(bs)[source]¶
Strip
<script>
and<style>
tags from a BeautifulSoup object- Parameters:
bs (bs4.BeautifulSoup) – BeautifulSoup object
- Returns:
BeautifulSoup object with tags removed
- Return type:
bs4.BeautifulSoup
- wrdrd.tools.crawl.sum_counters(iterable)[source]¶
Sum the counts of an iterable
- Parameters:
iterable (iterable) – iterable of collections.Counter dicts
- Returns:
dict of (key, count) pairs
- Return type:
defaultdict
- wrdrd.tools.crawl.to_a_search_engine(url)[source]¶
Get a list of words (e.g. as a classic search engine)
- Parameters:
url (str) – URL to
HTTP GET
withrequests.get
- Returns:
iterable of tokens
- Return type:
iterable
- wrdrd.tools.crawl.tokenize(text)[source]¶
Tokenize the given text with textblob.tokenizers.word_tokenize
- Parameters:
text (str) – text to tokenize
- Returns:
tokens
- Return type:
iterable
- wrdrd.tools.crawl.word_frequencies(url, keywords, stopwords=None)[source]¶
Get frequencies (counts) for a set of (non-stopword) keywords
- Parameters:
url (str) – URL from which keywords were derived
keywords (iterable) – iterable of keywords
- Returns:
- Return type:
- wrdrd.tools.crawl.wrdcrawler(url, output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)[source]¶
Fetch and generate a report from the given URL
- Parameters:
url (str) – URL to fetch
output (filelike) – output to
print()
to
- Returns:
output
- Return type:
filelike
wrdrd.tools.domain module¶
- wrdrd.tools.domain.check_google_dkim(domain, prefix='google')[source]¶
Check a Google DKIM DNS TXT record
- Parameters:
domain (str) – DNS domain name
prefix (str) – DKIM
s=
selector (‘DKIM prefix’)
- Returns:
0 if OK, 1 on error
- Return type:
int
Note
This check function only finds “v=DKIM1” TXT records; it defaults to the default
google
prefix and does not validate DKIM signatures.
- wrdrd.tools.domain.check_google_dmarc(domain)[source]¶
Check a Google DMARC DNS TXT record
- Parameters:
domain (str) – DNS domain name
- Returns:
0 if OK, 1 on error
- Return type:
int
- wrdrd.tools.domain.check_google_domain(domain, dkim_prefix='google')[source]¶
Check DNS MX, SPF, DMARC, and DKIM records for a Google Apps domain
- Parameters:
domain (str) – DNS domain
dkim_prefix (str) – DKIM prefix (
<prefix>._domainkey
)
- Returns:
nonzero returncode on failure (sum of returncodes)
- Return type:
int
- wrdrd.tools.domain.check_google_mx(domain)[source]¶
Check Google MX DNS records
- Parameters:
domain (str) – DNS domain name
- Returns:
0 if OK, 1 on error
- Return type:
int
- wrdrd.tools.domain.check_google_spf(domain)[source]¶
Check a Google SPF DNS TXT record
- Parameters:
domain (str) – DNS domain name
- Returns:
0 if OK, 1 on error
- Return type:
int
- wrdrd.tools.domain.dig_all(domain)[source]¶
Get all DNS records with dig
- Parameters:
domain (str) – DNS domain
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.dig_dnskey(zone)[source]¶
Get DNSSEC DNS records with dig
- Parameters:
zone (str) – DNS zone
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.dig_mx(domain)[source]¶
Get MX DNS records with dig
- Parameters:
domain (str) – DNS domain
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.dig_ns(domain)[source]¶
Get DNS NS records with dig
- Parameters:
domain (str) – DNS domain
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.dig_spf(domain)[source]¶
Get SPF DNS TXT records with dig
- Parameters:
domain (str) – DNS domain
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.dig_txt(domain)[source]¶
Get DNS TXT records with dig
- Parameters:
domain (str) – DNS domain
- Returns:
dig output
- Return type:
str
- wrdrd.tools.domain.domain_tools(domain)[source]¶
Get whois and DNS information for a domain.
- Parameters:
domain (str) – DNS domain name
- Returns:
nonzero returncode on failure (sum of returncodes)
- Return type:
int
- wrdrd.tools.domain.main(*args)[source]¶
wrdrd.tools.domain
main method- Parameters:
args (list) – commandline arguments
- Returns:
nonzero returncode on failure (sum of returncodes)
- Return type:
int
wrdrd.tools.stripsinglehtml module¶
- class wrdrd.tools.stripsinglehtml.Test_stripsinglehtml(methodName='runTest')[source]¶
Bases:
TestCase
- wrdrd.tools.stripsinglehtml.main(*args)[source]¶
wrdrd.tools.stripsinglehtml
main method: print unicode stripsinglehtml output tostdout
.- Parameters:
args (list) – list of commandline arguments
- Returns:
zero
- Return type:
int