Folders and Labels¶
Folders are exclusive.
Labels are inclusive.
#Hashtags are labels.
Folders form a tree which may be flat.
Labels can form a tree but are otherwise flat.
Folder path: a/b/c
Nested label: a.b.c
Bibliographic citations can take many forms.
Citations are most useful in a structured form (with a schema).
Citations in the bibliography or references or resources section of a textual document must be parsed in order to derive a citation graph.
Many impact statistics are derived from graph metricsa according to citation frequency (and, by implication, things like centrality).
Unicode symbols (Zero, Zerö, Zerø, Ƶero)
Stemming & Spelling Correction
“walking” -> walk -> walk, walking, walkers, walked
“Typoes and Mispelings” > “Fuzziness” https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzziness.html
String distance (hamming distance)
Substitution, Insertion, Deletion (see also: Operational Transformation)
Regional language variants
“Colour”, “Couleur”, and “Color”
“寿司”, “壽司”, and “Sushi”
Does “Apple” also return e.g. “Grapple”; or just e.g. “apples”, “appleton”, “apple pie”
a, and*, the, or*, not*
Logical Term grouping
“Quoting”, (Parentheses), Logical terms (Logic)
“This one” AND “That one”
“This one” AND (“that one”)
this one AND that one
-this one AND that one
-(“this one”) AND “that one”
(NOT “this one”) AND (“that one”)
Full table scan (match every row every time) [very slow]
Document-Term graph / tree
“index” non-stop words and phrases as graph edges
“entity recognition” / “entity extraction” / “phrase extraction”
OpenNLP (Java), NLTK (Python), Watson
“Mark Twain grew up not in Hannibal, Missouri but in St Louis, Missouri.”
Mark Twain (Mark, Twain, Mark Twain)
St Louis, Missouri
There are a number of extensions for CKAN: http://extensions.ckan.org/
ckanext-extractor can automatically extract text and metadata from datasets (including PDF). http://extensions.ckan.org/extension/extractor/
ckanext-datajson can generate data.gov JSON for datasets: http://extensions.ckan.org/extension/datajson/
Fedora supports OAI-PMH.
There are additional frontend web applications for Fedora:
Fedora Commons is the database for a number of well-known institutional repositories (e.g. book and digital asset library catalogs).
Hydra is an open source web application frontend for Fedora Commons written in PHP