Data Science

https://en.wikipedia.org/wiki/Data_science

Open Source Tools:

datasciencemasters.org

“The Open Source Data Science Masters”

Ten Simple Rules

#TenSimpleRules for Reproducible Computational Research

“Ten Simple Rules for Reproducible Computational Research”
DOI: 10.1371/journal.pcbi.1003285 Featured in PLOS Collections
  1. For Every Result, Keep Track of How It Was Produced

  2. Avoid Manual Data Manipulation Steps

  3. Archive the Exact Versions of All External Programs Used

  4. Version Control All Custom Scripts

  5. Record All Intermediate Results, When Possible in Standardized Formats

  6. For Analyses That Include Randomness, Note Underlying Random Seeds

  7. Always Store Raw Data behind Plots

  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

  9. Connect Textual Statements to Underlying Results

  10. Provide Public Access to Scripts, Runs, and Results

  1. For Every Result, Keep Track of How It Was Produced

  2. Avoid Manual Data Manipulation Steps

  3. Archive the Exact Versions of All External Programs Used

  4. Version Control All Custom Scripts

  5. Record All Intermediate Results, When Possible in Standardized Formats

  6. For Analyses That Include Randomness, Note Underlying Random Seeds

    Python random functions:

    print(os.environ['PYTHONHASHSEED'])
    RANDOMSEED = 1  # /dev/[x]random
    
    import random
    random.seed(RANDOMSEED)
    
    import numpy as np
    np.random.seed(RANDOMSEED)    # Seed
    print(np.random.get_state())  # State
    np.random.rand(4, 2) # (rows, cols, [...])
    np.random.randn(4, 2) # "standard normal" distribution
    

    Python hash randomization and algorithmic determinism:

  7. Always Store Raw Data behind Plots

  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

  9. Connect Textual Statements to Underlying Results

  10. Provide Public Access to Scripts, Runs, and Results

#TenSimpleRules for Creating a Good Data Management Plan

“Ten Simple Rules for Creating a Good Data Management Plan”
DOI: 10.1371/journal.pcbi.1004525
  1. Determine the Research Sponsor Requirements

  2. Identify the Data to Be Collected

  3. Define How the Data Will Be Organized

  4. Explain How the Data Will Be Documented

  5. Describe How Data Quality Will Be Assured

  6. Present a Sound Data Storage and Preservation Strategy

  7. Define the Project’s Data Policies

  8. Describe How the Data Will Be Disseminated

  9. Assign Roles and Responsibilities

  10. Prepare a Realistic Budget

http://journals.plos.org/plosone/s/data-availability

> PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.

Data, Information, Knowledge, & Wisdom

https://en.wikipedia.org/wiki/Data

https://en.wikipedia.org/wiki/Information

https://en.wikipedia.org/wiki/Knowledge (see: Knowledge Engineering)

https://en.wikipedia.org/wiki/Wisdom

# Lead -> Gold
  • Data is information

  • Information is data

  • Raw data is not knowledge

  • Wisdom compares knowledges

Optimization

https://en.wikipedia.org/wiki/Mathematical_optimization

Find local and global optima (maxima and minima) within an n-dimensional field which may be limited by resource constraints.

# Global optima of a 1-dimensional list
points = [10, 20, 100, 20, 10]
global_max, global_min = max(points), min(points)
assert global_max == 100
assert global_min == 10

# Local optima of a 1-dimensional list
sample = points[:1]
local_max, local_min = max(sample), min(sample)
assert local_max == 20
assert local_min == 10

# A 2-dimensional list ...
points = [(-0.5, 0),
          (0,  0.5),
          (0.5,  0),
          (0, -0.5)]

Smoothies

Data

Inputs, Outputs

Revenue:

2014-01-01 1200 CDT  $80
2014-01-01 1210 CDT  $100
2014-01-01 1500 CDT  $20

Expenses:

2014-01-01 wages     $256 ($8/hr * 8hrs * 4 people)
2014-01-01 utilities $100

Information

Aggregations, Tendencies

Revenue (gross):

2014-01-01  total: $200

Expenses:

2014-01-01  total: $356

Net:

2013-01-01  net:  -$200
2014-01-01  net:  -$156

On Mondays, we usually (on (simple) average) make about $500.

Knowledge

  • Positive net revenue is good.

  • One customer is worth the world to us.

Wisdom

We could save money by not being open on New Years Day, but, our loyal customers would not be happy about that.

Body Temperature

Data

time, body temp, outdoor temp, indoors/outdoors
time, exercise type, intensity, duration

Information

Daily temperature variance is about n degrees

Knowledge

  • Walking outside when it is warm increases body temperature

  • Walking outside when it is cold decreases body temperature

  • Exercise increases body temperature

Wisdom

If it’s 1745, and body temperature is n degrees above baseline, I’m probably walking outside and it is hot out.

Theory

Science

https://en.wikipedia.org/wiki/Outline_of_science

https://en.wikipedia.org/wiki/Category:Science

Cognitive Biases

https://en.wikipedia.org/wiki/Heuristics_in_judgment_and_decision-making

https://en.wikipedia.org/wiki/List_of_cognitive_biases

https://en.wikipedia.org/wiki/Critical_thinking

Open Science

https://en.wikipedia.org/wiki/Peer_review

Scientific Method

https://en.wikipedia.org/wiki/Argument

https://en.wikipedia.org/wiki/Empirical_evidence

https://en.wikipedia.org/wiki/Hypothesis

Reproducibility

See:

Systematic Review
Meta-analysis
Linked Reproducibility
Hashtag: #LinkedReproducibility

Note

This heading is now merged into a separate page: LinkedReproducibility

Math

https://en.wikipedia.org/wiki/Mathematics

https://en.wikipedia.org/wiki/Outline_of_mathematics

https://en.wikipedia.org/wiki/Mathematics_education#Methods

Math Courses

Project Euler

Project Euler is an very well-known set of math algorithm problems with free online grading.

Rosalind

Rosalind hosts a number of Python-based Bioinformatics and Data Science Problems and Exercises with free online grading.

Mathematical Notation

See:

LaTeX

https://en.wikipedia.org/wiki/Comparison_of_TeX_editors

latex2sympy

latex2sympy converts from LaTeX to Python code that works with the SymPy CAS (Computer Algebra System).

  • latex2sympy is now integrated with SymPy as sympy.parsing.latex.parse_latex: https://docs.sympy.org/latest/modules/parsing.html

    #! pip install -y antlr4-python3-runtime
    #! conda install -y antlr-python-runtime
    from sympy.parsing.latex import parse_latex
    
    parse_latex(r'\frac{n(n+1)(2n+1)}{6}')
    # ((2*n + 1)*n(n + 1))/6
    parse_latex(r'\prod\limits_{i=1}^n x = x^n')
    # LaTeXParsingError: I don't understand this
    # \prod\limits_{i=1}^n x = x^n
    # ~~~~~^
    
MathJax

MathJax is a Javascript library for displaying MathML, LaTeX, and ASCIIMathML markup in a browser.

Jupyter and LaTeX

Jupyter Notebook supports a number of different ways to include LaTeX/MathTeX in a notebook with MathJax:

  1. In a Markdown cell, wrap the LaTeX in double dollar signs: $$:

    $$c = \sqrt{a^2 + b^2}$$
    
    Note that these render differently:
    $$x = share price_today^2 $$
    $$x = {share price}_{today}^2 $$
    $$x = \text{share price}_{today}^2 $$
    $$x = \textit{share price}_{today}^2 $$
    
  2. To display a LaTeX expression inline (without surrounding newline), wrap it in single dollar signs: $:

    The quadratic equation, $c = \sqrt{a^2 + b^2}$, looks curiously
    like the quantum probability amplitude equation.
    

    To display multiple regular dollar signs, escape them with double-backslash \\:

    One dollar sign: \\$ and another \\$
    
  3. Start a Markdown cell with %%latex:

    %%latex
    c = \sqrt{a^2 + b^2}
    
  4. Wrap a latex block with $ and \begin{align}:

    $
    \begin{align}
    \textit{Earnings Per Share} & = \frac{\textit{Earnings}}{\textit{Market Value Per Share}} \\
    \textit{EPS} & = \frac{\textit{Earnings}}{\textit{Share Price}}
    \end{align}
    $
    
  5. Call the display() function with one or more Math/Latex objects, or just return a Math/Latex object:

    from IPython.display import Math
    Math(r'c = \sqrt{a^2 + b^2}')
    
    from IPython.display import Math, Latex, display
    display(
       Math(r'c = \sqrt{a^2 + b^2}'),
       Latex(r'''\begin{align}''' +'\n'+ 'y = mx+b' +'\n'+ '\end{align}')))
    

Resources for learning Jupyter and LaTeX:

MathML
ASCIIMathML

Information Theory

https://en.wikipedia.org/wiki/Information_theory

https://en.wikipedia.org/wiki/Entropy_(information_theory)

https://en.wikipedia.org/wiki/Signal_(electrical_engineering)

https://en.wikipedia.org/wiki/Noise_(signal_processing)

https://en.wikipedia.org/wiki/Signal-to-noise_ratio

https://en.wikipedia.org/wiki/Probability_theory

https://en.wikipedia.org/wiki/Quantum_information_science

Linear Algebra

https://en.wikipedia.org/wiki/Linear_algebra

Linear Algebra Software

Calculus

https://en.wikipedia.org/wiki/Calculus

Calculus Software

Statistics

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Outline_of_statistics

https://en.wikipedia.org/wiki/Category:Statistics

Parametric Statistics

https://en.wikipedia.org/wiki/Parametric_statistics

Regression Analysis

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Template:Regression_bar

Nonparametric Statistics

https://en.wikipedia.org/wiki/Nonparametric_statistics

Descriptive Statistics

https://en.wikipedia.org/wiki/Descriptive_statistics

Statistical Inference

https://en.wikipedia.org/wiki/Statistical_inference

Causality

https://en.wikipedia.org/wiki/Causality

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

https://en.wikipedia.org/wiki/Sensitivity_analysis

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc

Probability and Statistics Software

Analysis

https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Data_processing#Data_processing_functions

Learning

https://en.wikipedia.org/wiki/Learning

https://en.wikipedia.org/wiki/Autodidacticism

https://en.wikipedia.org/wiki/Perceptual_learning

https://en.wikipedia.org/wiki/Pattern_recognition_(psychology)#False_pattern_recognition

https://en.wikipedia.org/wiki/Rhetoric

https://en.wikipedia.org/wiki/Socratic_method

https://en.wikipedia.org/wiki/Socratic_questioning

https://en.wikipedia.org/wiki/Platonic_dialogue#The_dialogues

https://en.wikipedia.org/wiki/Dialectic

https://en.wikipedia.org/wiki/Dialogue

https://en.wikipedia.org/wiki/Perturbation_theory_(quantum_mechanics)

https://en.wikipedia.org/wiki/Validated_learning

https://en.wikipedia.org/wiki/Organizational_learning

See: Knowledge Engineering

Data Mining

https://en.wikipedia.org/wiki/Data_mining

https://en.wikipedia.org/wiki/Knowledge_extraction

https://en.wikipedia.org/wiki/Extract,_transform,_load

Data Dredging

Machine Learning

https://en.wikipedia.org/wiki/Online_machine_learning

Deep Learning

Datasets

awesome-public-datasets

https://github.com/caesar0301/awesome-public-datasets

Awesome

https://github.com/bayandin/awesome-awesomeness

Tools

ETL

Workflow

“Data Provenance”, “Data Lineage”

See:

Techniques

Automated Workflows

Standard, Automated Workflows

Q: Is there confirmation bias in starting with e.g. simple regression analysis?

Q: Which factors did we know we were capturing?

5 ★ Linked Open Data

http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).

☆☆

Publish structured data on the Web in a machine-readable format (e.g. XML).

☆☆☆

Publish structured data on the Web in a documented, non-proprietary data format (e.g. CSV, KML).

☆☆☆☆

Publish structured data on the Web as RDF (e.g. Turtle, RDFa, JSON-LD, SPARQL.)

☆☆☆☆☆

In your RDF, have the identifiers be links (URLs) to useful data sources.

http://5stardata.info/

See: Knowledge Engineering, Semantic Web Standards

Data Visualization

Visualizing Data Science

The Data Science Venn Diagram

Field representations

Data Visualization Tools

Matplotlib

.

  • pandas plot functions generate matplotlib charts.

Seaborn

  • “Seaborn is a Python visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.”

Mayavi

Bokeh

VisPy

Vega

Vincent

Plotly

PyQtGraph

http://www.pyqtgraph.org/ (OpenGL)

qgrid

D3.js

Three.js

(WebGL)

  • Google ARCore Web is built on Three.js

  • React VR is built on Three.js

Sigmajs

See Also