9 Descriptive statistics & lifecycle of data science

Common data science project phases include defining a problem, collecting and cleaning data, initial data exploration and depending on the nature of the question driving the study, analyses based on the sample of data to learn more about the larger population and/or building predictive models. Descriptive statistics can come handy in most if not all phases of the project. For instance when collecting data we can keep track of number of experiments to be run to ensure big enough sample size and when monitoring predictive models we can summarize model performance.

flowchart LR
  A(Define problem) --> B(Collect data)
  B --> C(Clean data)
  C --> D(Explore data)
  D --> E(Inferential statistics)
  D --> F(Predictive modelling)
  E --> G(Communicate results)
  F --> G(Communicate results)

Figure 9.1: A schematic representation of the phases of the data science project.

Typically though, we rely the most on the descriptive statistics during the exploring data phase. This phase, is often called Exploratory Data Analysis and abbreviated as EDA. EDA was introduced in 1970s by John Tukey, american mathematician and statistician, to encourage statisticians to explore the data, and formulate hypotheses that could lead to new data collection and experiments. Prior the introduction of EDA, initial data analysis, IDA was used with a narrow focus on checking data quality and model assumptions required for statistical modeling and hypothesis testing.

Data science projects rarely require only one pass through the project phases and often one returns to previous steps many times given the results from the down-stream steps. For instance, one may perform EDA, discover and handle missing data, and redo the EDA. Or one may try some hypothesis tests that would lead to new questions for which one would repeat both EDA and inferential data analysis parts.

flowchart LR
  A(Define problem) --> B(Collect data)
  B --> C(Clean data)
  C --> D(Explore data)
  D --> E(Inferential statistics)
  D --> F(Predictive modelling)
  E --> G(Communicate results)
  F --> G(Communicate results)
  G --> A
  G --> B
  G --> C
  G --> D
  E --> A
  E --> B
  E --> C
  E --> D
  F --> B
  F --> C
  F --> D

Figure 9.2: A schematic representation of the phases of the data science project showing that one often returns to the earlier phases of the projects depending the outcome of the down-stream steps.