Research¶
Folders and Labels¶
Folders are exclusive.
Labels are inclusive.
#Hashtags are labels.
Folders form a tree which may be flat.
Labels can form a tree but are otherwise flat.
Folder path: a/b/c
Nested label: a.b.c
Citation Metadata¶
Bibliographic citations can take many forms.
Citations are most useful in a structured form (with a schema).
Schema.org CreativeWork
Citations in the bibliography or references or resources section of a textual document must be parsed in order to derive a citation graph.
Many impact statistics are derived from graph metricsa according to citation frequency (and, by implication, things like centrality).
See:
Search engines¶
Query syntax
Case sensitivity
Unicode symbols (Zero, Zerö, Zerø, Ƶero)
Stemming & Spelling Correction
“walking” -> walk -> walk, walking, walkers, walked
Fuzzy matching
-
“Typoes and Mispelings” > “Fuzziness” https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzziness.html
String distance (hamming distance)
Substitution, Insertion, Deletion (see also: Operational Transformation)
-
Regional language variants
https://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#-our.2C_-or
“Colour”, “Couleur”, and “Color”
https://en.wikipedia.org/wiki/Romanization
“寿司”, “壽司”, and “Sushi”
String prefixes
Does “Apple” also return e.g. “Grapple”; or just e.g. “apples”, “appleton”, “apple pie”
Stop words
a, and*, the, or*, not*
Logical Term grouping
“Quoting”, (Parentheses), Logical terms (Logic)
“This one” AND “That one”
“This one” AND (“that one”)
this one AND that one
-this one AND that one
-(“this one”) AND “that one”
(NOT “this one”) AND (“that one”)
Search algorithms:
natural language
Full table scan (match every row every time) [very slow]
Document-Term graph / tree
“index” non-stop words and phrases as graph edges
“entity recognition” / “entity extraction” / “phrase extraction”
OpenNLP (Java), NLTK (Python), Watson
“Mark Twain grew up not in Hannibal, Missouri but in St Louis, Missouri.”
grew up
Mark Twain (Mark, Twain, Mark Twain)
Hannibal
Hannibal, Missouri
St Louis
St Louis, Missouri
Manual Index
Research Tools¶
Mendeley¶
Zotero is similar to Mendeley.
Zotero¶
See:
Mendeley is similar to Zotero.
CKAN¶
CKAN (Comprehensive Knowledge Archive Network) is an open source web application for cataloging data written in Python.
There are a number of extensions for CKAN: http://extensions.ckan.org/
ckanext-extractor can automatically extract text and metadata from datasets (including PDF). http://extensions.ckan.org/extension/extractor/
see also:
ckanext-datajson can generate data.gov JSON for datasets: http://extensions.ckan.org/extension/datajson/
DSpace¶
DSpace is an open source web application for creative works and their XML metadata written in Java.
DSpace supports OAI-PMH.
DSpace and Fedora Commons are now both part of DuraSpace.
Fedora Commons¶
Fedora Commons (Fedora Repository, Fedora) is an open source web application for creative works and their XML metadata written in Java.
Fedora supports OAI-PMH.
Fedora can index metadata with other search engines (e.g. Solr, ElasticSearch)
There are additional frontend web applications for Fedora:
Fedora Commons is the database for a number of well-known institutional repositories (e.g. book and digital asset library catalogs).
Hydra¶
Hydra is an open source web application frontend for Fedora Commons written in Ruby
Blacklight¶
Blacklight is an open source web application written in Ruby for providing a search interface to Solr.
Hydra indexes Fedora Commons metadata with Solr; which can be displayed with Blacklight.
Islandora¶
Hydra is an open source web application frontend for Fedora Commons written in PHP
Drupal (PHP)
Islandora indexes Fedora Commons metadata with Solr; which can be displayed with the Islandora Drupal application.
OAI-PMH¶
OAI-PMH (Open Metadata Institute Protocol for Metadata Harvesting) is an XML over HTTP standard for sharing metadata about creative works with Dublin Core (DCMI dcterms) and other schema.
Fedora Commons supports OAI-PMH.
DSpace supports OAI-PMH.