Data Science¶

https://en.wikipedia.org/wiki/Data_science

Science
- Open Science
- Scientific Method
Reproducibility
- Ten Simple Rules
- Linked Reproducibility

Jupyter Notebook
- CoCalc (SageMath)
- Google Colab
- Jupyter Docker Stacks (Conda)
- Jupyter Extensions
- Jupyter and Reproducibility

datasciencemasters.org¶

“The Open Source Data Science Masters”
http://datasciencemasters.org/

Ten Simple Rules¶

Homepage: http://collections.plos.org/ten-simple-rules
Hashtag: #TenSimpleRules
Twitter: https://twitter.com/hashtag/TenSimpleRules?src=hash

#TenSimpleRules for Reproducible Computational Research¶

“Ten Simple Rules for Reproducible Computational Research”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
DOI: 10.1371/journal.pcbi.1003285 Featured in PLOS Collections

For Every Result, Keep Track of How It Was Produced

Avoid Manual Data Manipulation Steps

Archive the Exact Versions of All External Programs Used

Version Control All Custom Scripts

Record All Intermediate Results, When Possible in Standardized Formats

For Analyses That Include Randomness, Note Underlying Random Seeds

Always Store Raw Data behind Plots

Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Connect Textual Statements to Underlying Results

Provide Public Access to Scripts, Runs, and Results

For Every Result, Keep Track of How It Was Produced
- Git
- RDF, JSON-LD (e.g. W3C PROV)
- Workflow
- Knowledge Engineering > Linked Data
Avoid Manual Data Manipulation Steps
- Workflow
- Continuous Delivery
  - Test Automation (e.g. Test Driven Development)
- Data pipelines composed of containers
Archive the Exact Versions of All External Programs Used
- Jupyter and Reproducibility (%version_information, %watermark) (should be “Reproducibility and Jupyter Notebook”)
Version Control All Custom Scripts
- Revision Control (e.g. Distributed Version Control)
Record All Intermediate Results, When Possible in Standardized Formats
- Linked Data (e.g. 5 ★ Linked Open Data)
For Analyses That Include Randomness, Note Underlying Random Seeds

Python random functions:
```
print(os.environ['PYTHONHASHSEED'])
RANDOMSEED = 1  # /dev/[x]random

import random
random.seed(RANDOMSEED)

import numpy as np
np.random.seed(RANDOMSEED)    # Seed
print(np.random.get_state())  # State
np.random.rand(4, 2) # (rows, cols, [...])
np.random.randn(4, 2) # "standard normal" distribution
```
- http://docs.scipy.org/doc/numpy/reference/routines.random.html#distributions
Python hash randomization and algorithmic determinism:

python -R

https://docs.python.org/3/using/cmdline.html#cmdoption-R

PYTHONHASHSEED

https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
Always Store Raw Data behind Plots
- Or, “Generate all plots from [source-controlled] [transforms-of] raw data”
- ./data
- ./tests/data
- ./nb/data (./notebooks)
- Data Visualization, Data Visualization Tools
Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
- pandas:
  - http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking
  - http://pandas.pydata.org/pandas-docs/stable/reshaping.html#combining-with-stats-and-groupby
- Schema.org: https://schema.org/docs/full.html
- SKOS:
  
  http://www.w3.org/TR/skos-reference/
  
  http://www.w3.org/TR/skos-reference/skos.html
  
  skos:narrower, skos:narrowerTransitive, skos:broader , skos:broaderTransistive, […]
- XKOS: “An SKOS extension for representing statistical classifications”
  
  http://rdf-vocabulary.ddialliance.org/xkos.html
- RDF Data Cubes: “The RDF Data Cube Vocabulary”
  
  qb:DataSet, qb:Dimension, qb:ObservationGroup, qb:Slice, […]
  
  http://www.w3.org/TR/vocab-data-cube/
Connect Textual Statements to Underlying Results
- Linked Data: URIs, URLs, #uri-fragments
- Turtle / TriG: <> (this document, this named graph)
- ReStructuredText
  - http://sphinx-doc.org/rest.html#footnotes #citations #substitutions
  - https://github.com/yoloseem/awesome-sphinxdoc
- Linked Reproducibility: URIs, URLs, #uri-fragments
Provide Public Access to Scripts, Runs, and Results

#TenSimpleRules for Creating a Good Data Management Plan¶

“Ten Simple Rules for Creating a Good Data Management Plan”
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525
DOI: 10.1371/journal.pcbi.1004525

Determine the Research Sponsor Requirements

Identify the Data to Be Collected

Define How the Data Will Be Organized

Explain How the Data Will Be Documented

Describe How Data Quality Will Be Assured

Present a Sound Data Storage and Preservation Strategy

Define the Project’s Data Policies

Describe How the Data Will Be Disseminated

Assign Roles and Responsibilities

Prepare a Realistic Budget

http://journals.plos.org/plosone/s/data-availability

> PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.

Data, Information, Knowledge, & Wisdom¶

https://en.wikipedia.org/wiki/Data

https://en.wikipedia.org/wiki/Information

https://en.wikipedia.org/wiki/Knowledge (see: Knowledge Engineering)

https://en.wikipedia.org/wiki/Wisdom

# Lead -> Gold

Data is information
Information is data
Raw data is not knowledge
Wisdom compares knowledges

Optimization¶

https://en.wikipedia.org/wiki/Mathematical_optimization

Find local and global optima (maxima and minima) within an n-dimensional field which may be limited by resource constraints.

# Global optima of a 1-dimensional list
points = [10, 20, 100, 20, 10]
global_max, global_min = max(points), min(points)
assert global_max == 100
assert global_min == 10

# Local optima of a 1-dimensional list
sample = points[:1]
local_max, local_min = max(sample), min(sample)
assert local_max == 20
assert local_min == 10

# A 2-dimensional list ...
points = [(-0.5, 0),
          (0,  0.5),
          (0.5,  0),
          (0, -0.5)]

Smoothies¶

Data

Inputs, Outputs

Revenue:

2014-01-01 1200 CDT  $80
2014-01-01 1210 CDT  $100
2014-01-01 1500 CDT  $20

Expenses:

2014-01-01 wages     $256 ($8/hr * 8hrs * 4 people)
2014-01-01 utilities $100

Information

Aggregations, Tendencies

Revenue (gross):

2014-01-01  total: $200

Expenses:

2014-01-01  total: $356

Net:

2013-01-01  net:  -$200
2014-01-01  net:  -$156

On Mondays, we usually (on (simple) average) make about $500.

Knowledge

Positive net revenue is good.
One customer is worth the world to us.

Wisdom

We could save money by not being open on New Years Day, but, our loyal customers would not be happy about that.

Body Temperature¶

Data

time, body temp, outdoor temp, indoors/outdoors
time, exercise type, intensity, duration

Information

Daily temperature variance is about n degrees

Knowledge

Walking outside when it is warm increases body temperature
Walking outside when it is cold decreases body temperature
Exercise increases body temperature

Wisdom

If it’s 1745, and body temperature is n degrees above baseline, I’m probably walking outside and it is hot out.

Theory¶

See:

Systematic Review¶

Wikipedia: https://en.wikipedia.org/wiki/Systematic_review

Meta-analysis¶

Wikipedia: https://en.wikipedia.org/wiki/Meta-analysis

Linked Reproducibility¶

Hashtag: #LinkedReproducibility
Twitter: https://twitter.com/hashtag/LinkedReproducibility
Wrdrddocs: LinkedReproducibility

Note

This heading is now merged into a separate page: LinkedReproducibility

Math¶

https://en.wikipedia.org/wiki/Mathematics

https://en.wikipedia.org/wiki/Outline_of_mathematics

https://en.wikipedia.org/wiki/Mathematics_education#Methods

Math Courses¶

Project Euler¶

Wikipedia: https://en.wikipedia.org/wiki/Project_Euler

Homepage: https://projecteuler.net/

Project Euler is an very well-known set of math algorithm problems with free online grading.

Rosalind¶

Web: http://rosalind.info/

Rosalind hosts a number of Python-based Bioinformatics and Data Science Problems and Exercises with free online grading.

Mathematical Notation¶

See:

Knowledge Engineering > Symbols
Units > Units and RDF

LaTeX¶

Wikipedia: https://en.wikipedia.org/wiki/LaTeX
LearnXinYMinutes: https://learnxinyminutes.com/docs/latex/
Docs: http://en.wikibooks.org/wiki/LaTeX

https://en.wikipedia.org/wiki/LaTeX#Example
“A Primer on Using LaTeX in Jupyter Notebooks” http://data-blog.udacity.com/posts/2016/10/latex-primer/

https://en.wikipedia.org/wiki/Comparison_of_TeX_editors

https://en.wikipedia.org/wiki/LyX
https://twitter.com/wstein389/status/1002446637908811776

Completely new LaTeX editor in https://cocalc.com . Open source, is written in React, has unlimited multipanel views, realtime collab, records all edits (TimeTravel), forward an inverse search, clickable links in the PDF,supports SageTex out of the box, and autoformat…

https://github.com/jupyterlab/jupyterlab-latex

https://www.google.com/search?q=collaborative+latex

latex2sympy¶

Pypi: https://pypi.org/project/latex2sympy3/
Src: https://github.com/augustt198/latex2sympy
Doc: https://docs.sympy.org/latest/modules/parsing.html

latex2sympy converts from LaTeX to Python code that works with the SymPy CAS (Computer Algebra System).

latex2sympy is now integrated with SymPy as sympy.parsing.latex.parse_latex: https://docs.sympy.org/latest/modules/parsing.html

#! pip install -y antlr4-python3-runtime
#! conda install -y antlr-python-runtime
from sympy.parsing.latex import parse_latex

parse_latex(r'\frac{n(n+1)(2n+1)}{6}')
# ((2*n + 1)*n(n + 1))/6
parse_latex(r'\prod\limits_{i=1}^n x = x^n')
# LaTeXParsingError: I don't understand this
# \prod\limits_{i=1}^n x = x^n
# ~~~~~^

MathJax¶

Wikipedia: https://en.wikipedia.org/wiki/MathJax
Homepage: https://www.mathjax.org/
Docs: https://docs.mathjax.org/en/latest/input/tex/

MathJax is a Javascript library for displaying MathML, LaTeX, and ASCIIMathML markup in a browser.

http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

Jupyter and LaTeX¶

Jupyter Notebook supports a number of different ways to include LaTeX/MathTeX in a notebook with MathJax:

In a Markdown cell, wrap the LaTeX in double dollar signs: $$:

$$c = \sqrt{a^2 + b^2}$$

Note that these render differently:
$$x = share price_today^2 $$
$$x = {share price}_{today}^2 $$
$$x = \text{share price}_{today}^2 $$
$$x = \textit{share price}_{today}^2 $$

To display a LaTeX expression inline (without surrounding newline), wrap it in single dollar signs: $:
```
The quadratic equation, $c = \sqrt{a^2 + b^2}$, looks curiously
like the quantum probability amplitude equation.
```
To display multiple regular dollar signs, escape them with double-backslash \\:
```
One dollar sign: \\$ and another \\$
```
Start a Markdown cell with %%latex:
```
%%latex
c = \sqrt{a^2 + b^2}
```

Wrap a latex block with $ and \begin{align}:

$
\begin{align}
\textit{Earnings Per Share} & = \frac{\textit{Earnings}}{\textit{Market Value Per Share}} \\
\textit{EPS} & = \frac{\textit{Earnings}}{\textit{Share Price}}
\end{align}
$

Call the display() function with one or more Math/Latex objects, or just return a Math/Latex object:

from IPython.display import Math
Math(r'c = \sqrt{a^2 + b^2}')

from IPython.display import Math, Latex, display
display(
   Math(r'c = \sqrt{a^2 + b^2}'),
   Latex(r'''\begin{align}''' +'\n'+ 'y = mx+b' +'\n'+ '\end{align}')))

Resources for learning Jupyter and LaTeX:

MathML¶

Wikipedia: https://en.wikipedia.org/wiki/MathML

ASCIIMathML¶

Wikipedia: https://en.wikipedia.org/wiki/ASCIIMathML

ASCII
MathML

Information Theory¶

https://en.wikipedia.org/wiki/Information_theory

https://en.wikipedia.org/wiki/Entropy_(information_theory)

https://en.wikipedia.org/wiki/Signal_(electrical_engineering)

https://en.wikipedia.org/wiki/Noise_(signal_processing)

https://en.wikipedia.org/wiki/Signal-to-noise_ratio

https://en.wikipedia.org/wiki/Probability_theory

https://en.wikipedia.org/wiki/Quantum_information_science

https://en.wikipedia.org/wiki/Quantum_information

Linear Algebra¶

https://en.wikipedia.org/wiki/Linear_algebra

Linear Algebra Software¶

Calculus¶

https://en.wikipedia.org/wiki/Calculus

Calculus Software¶

Statistics¶

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Outline_of_statistics

https://en.wikipedia.org/wiki/Category:Statistics

Parametric Statistics¶

https://en.wikipedia.org/wiki/Parametric_statistics

Regression Analysis¶

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Template:Regression_bar

Nonparametric Statistics¶

https://en.wikipedia.org/wiki/Nonparametric_statistics

Descriptive Statistics¶

https://en.wikipedia.org/wiki/Descriptive_statistics

Statistical Inference¶

https://en.wikipedia.org/wiki/Statistical_inference

Causality¶

https://en.wikipedia.org/wiki/Causality

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

https://en.wikipedia.org/wiki/Sensitivity_analysis

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc

Probability and Statistics Software¶

Analysis¶

https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Data_processing#Data_processing_functions

Learning¶

https://en.wikipedia.org/wiki/Learning

https://en.wikipedia.org/wiki/Autodidacticism

https://en.wikipedia.org/wiki/Perceptual_learning

https://en.wikipedia.org/wiki/Pattern_recognition_(psychology)#False_pattern_recognition

https://en.wikipedia.org/wiki/Rhetoric

https://en.wikipedia.org/wiki/Socratic_method

https://en.wikipedia.org/wiki/Socratic_questioning

https://en.wikipedia.org/wiki/Platonic_dialogue#The_dialogues

https://en.wikipedia.org/wiki/Dialectic

https://en.wikipedia.org/wiki/Dialogue

https://en.wikipedia.org/wiki/Perturbation_theory_(quantum_mechanics)

https://en.wikipedia.org/wiki/Validated_learning

https://en.wikipedia.org/wiki/Organizational_learning

See: Knowledge Engineering

Data Mining¶

https://en.wikipedia.org/wiki/Data_mining

https://en.wikipedia.org/wiki/Knowledge_extraction

https://en.wikipedia.org/wiki/Extract,_transform,_load

Data Dredging¶

Wikipedia: https://en.wikipedia.org/wiki/Data_dredging

!
Causality
spurious correlations
- http://tylervigen.com/spurious-correlations

Machine Learning¶

Wikipedia: https://en.wikipedia.org/wiki/Machine_learning
Awesome: https://github.com/onurakpolat/awesome-bigdata
Awesome: https://github.com/josephmisiti/awesome-machine-learning

https://en.wikipedia.org/wiki/Online_machine_learning

Deep Learning¶

Wikipedia: https://en.wikipedia.org/wiki/Deep_learning

Datasets¶

awesome-public-datasets¶

https://github.com/caesar0301/awesome-public-datasets

https://github.com/caesar0301/awesome-public-datasets#search-engines

Awesome¶

https://github.com/bayandin/awesome-awesomeness

Tools¶

ETL¶

Wikipedia: https://en.wikipedia.org/wiki/Extract,_transform,_load

https://en.wikipedia.org/wiki/Extract,_transform,_load#Real-life_ETL_cycle

Workflow¶

Scientific Method
Project Management
https://en.wikipedia.org/wiki/Checklist
https://en.wikipedia.org/wiki/Scientific_workflow_system
Units of measure
I/O Transforms of information(/energy)

“Data Provenance”, “Data Lineage”

See:

Techniques¶

Automated Workflows¶

Standard, Automated Workflows

Q: Is there confirmation bias in starting with e.g. simple regression analysis?

Q: Which factors did we know we were capturing?

5 ★ Linked Open Data¶

http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆

Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).

☆☆

Publish structured data on the Web in a machine-readable format (e.g. XML).

☆☆☆

Publish structured data on the Web in a documented, non-proprietary data format (e.g. CSV, KML).

☆☆☆☆

Publish structured data on the Web as RDF (e.g. Turtle, RDFa, JSON-LD, SPARQL.)

☆☆☆☆☆

In your RDF, have the identifiers be links (URLs) to useful data sources.

—http://5stardata.info/

See: Knowledge Engineering, Semantic Web Standards

Data Visualization¶

Wikipedia: https://en.wikipedia.org/wiki/Data_visualization

Visualizing Data Science¶

The Data Science Venn Diagram

Field representations

Data Visualization Tools¶

Matplotlib¶

Wikipedia: https://en.wikipedia.org/wiki/Matplotlib
Homepage: https://matplotlib.org/
Src: https://github.com/matplotlib/matplotlib
Docs: https://matplotlib.org/contents.html

ref:Scipy lectures:

http://scipy-lectures.github.io/intro/matplotlib/matplotlib.html
Scientific-python-lectures:

http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb
http://stanford.edu/~mwaskom/software/seaborn/index.html
http://tonysyu.github.com/mpltools/auto_examples/index.html#style-package
http://mpld3.github.io/ (Matplotlib + D3.js)
conda install matplotlib (Conda (Anaconda))

.

pandas plot functions generate matplotlib charts.

Seaborn¶

Src: https://github.com/mwaskom/seaborn
Docs: http://seaborn.pydata.org/
Docs: http://seaborn.pydata.org/examples/

“Seaborn is a Python visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.”

Mayavi¶

Wikipedia: https://en.wikipedia.org/wiki/MayaVi
Src: https://github.com/enthought/mayavi
Docs: http://docs.enthought.com/mayavi/mayavi/

“Mayavi: 3D scientific data visualization and plotting in Python”
ref:Scipy lectures:

https://scipy-lectures.github.io/packages/3d_plotting/

Bokeh¶

Src: https://github.com/bokeh/bokeh

Docs: https://bokeh.pydata.org/

VisPy¶

Homepage: http://vispy.org/ (OpenGL)

Src: https://github.com/vispy/vispy

Vega¶

Homepage: https://trifacta.github.io/vega/

Vincent¶

Src: https://github.com/wrobstory/vincent

Plotly¶

Wikipedia: https://en.wikipedia.org/wiki/Plotly

Homepage: https://plot.ly/

PyQtGraph¶

http://www.pyqtgraph.org/ (OpenGL)

qgrid¶

Src: https://github.com/quantopian/qgrid

(SlickGrid w/ IPython Notebook/ Jupyter Notebook
pandas support

D3.js¶

Wikipedia: https://en.wikipedia.org/wiki/D3.js

Homepage: http://d3js.org/

Three.js¶

Wikipedia: https://en.wikipedia.org/wiki/Three.js

Homepage: http://threejs.org/

(WebGL)

Google ARCore Web is built on Three.js
React VR is built on Three.js

Sigmajs¶

Homepage: http://sigmajs.org/

Graphs in Javascript

Data Science¶

datasciencemasters.org¶

Ten Simple Rules¶

#TenSimpleRules for Reproducible Computational Research¶

#TenSimpleRules for Creating a Good Data Management Plan¶

Data, Information, Knowledge, & Wisdom¶

Optimization¶

Smoothies¶

Body Temperature¶

Theory¶

Science¶

Cognitive Biases¶

Open Science¶

Scientific Method¶

Reproducibility¶

Systematic Review¶

Meta-analysis¶

Linked Reproducibility¶

Math¶

Math Courses¶

Project Euler¶

Rosalind¶

Mathematical Notation¶

LaTeX¶

latex2sympy¶

MathJax¶

Jupyter and LaTeX¶

MathML¶

ASCIIMathML¶

Information Theory¶

Linear Algebra¶

Linear Algebra Software¶

Calculus¶

Calculus Software¶

Statistics¶

Parametric Statistics¶

Regression Analysis¶

Nonparametric Statistics¶

Descriptive Statistics¶

Statistical Inference¶

Causality¶

Probability and Statistics Software¶

Analysis¶

Learning¶

Data Mining¶

Data Dredging¶

Machine Learning¶

Deep Learning¶

Datasets¶

awesome-public-datasets¶

Awesome¶

Tools¶

ETL¶

Workflow¶

Techniques¶

Automated Workflows¶

5 ★ Linked Open Data¶

Data Visualization¶

Visualizing Data Science¶

Data Visualization Tools¶

Matplotlib¶

Seaborn¶

Mayavi¶

Bokeh¶

VisPy¶

Vega¶

Vincent¶

Plotly¶

PyQtGraph¶

qgrid¶

D3.js¶

Three.js¶

Sigmajs¶

See Also¶