Folders and Labels¶
- Folders are exclusive.
- Labels are inclusive.
- #Hashtags are labels.
- Folders form a tree which may be flat.
- Labels can form a tree but are otherwise flat.
- Folder path: a/b/c
- Nested label: a.b.c
Bibliographic citations can take many forms.
Citations are most useful in a structured form (with a schema).
Citations in the bibliography or references or resources section of a textual document must be parsed in order to derive a citation graph.
Many impact statistics are derived from graph metricsa according to citation frequency (and, by implication, things like centrality).
- Knowledge Engineering > Search Engine Indexing
- Query syntax
- Case sensitivity
- Unicode symbols (Zero, Zerö, Zerø, Ƶero)
- Stemming & Spelling Correction
- “walking” -> walk -> walk, walking, walkers, walked
- Fuzzy matching
- “Typoes and Mispelings” > “Fuzziness”
- String distance (hamming distance)
- Substitution, Insertion, Deletion (see also: Operational Transformation)
- “Typoes and Mispelings” > “Fuzziness” https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzziness.html
- Regional language variants
- “Colour”, “Couleur”, and “Color”
- “寿司”, “壽司”, and “Sushi”
- String prefixes
- Does “Apple” also return e.g. “Grapple”; or just e.g. “apples”, “appleton”, “apple pie”
- Stop words
- a, and*, the, or*, not*
- Logical Term grouping
- “Quoting”, (Parentheses), Logical terms (Logic)
- “This one” AND “That one”
- “This one” AND (“that one”)
- this one AND that one
- -this one AND that one
- -(“this one”) AND “that one”
- (NOT “this one”) AND (“that one”)
- Search algorithms:
- Search Engine Indexing
- Data Structures
- natural language
- Full table scan (match every row every time) [very slow]
- Document-Term graph / tree
- “index” non-stop words and phrases as graph edges
- “entity recognition” / “entity extraction” / “phrase extraction”
- OpenNLP (Java), NLTK (Python), Watson
- “Mark Twain grew up not in Hannibal, Missouri
but in St Louis, Missouri.”
- grew up
- Mark Twain (Mark, Twain, Mark Twain)
- Hannibal, Missouri
- St Louis
- St Louis, Missouri
- Manual Index
There are a number of extensions for CKAN: http://extensions.ckan.org/
ckanext-extractor can automatically extract text and metadata from datasets (including PDF). http://extensions.ckan.org/extension/extractor/
ckanext-datajson can generate data.gov JSON for datasets: http://extensions.ckan.org/extension/datajson/
- Fedora supports OAI-PMH.
- Fedora can index metadata with other search engines (e.g. Solr, ElasticSearch)
- There are additional frontend web applications for Fedora:
- Fedora Commons is the database for a number of well-known institutional repositories (e.g. book and digital asset library catalogs).
Hydra is an open source web application frontend for Fedora Commons written in PHP