May 30, 2025

Basic Stats

I keep forgetting basic stats every other month - would be nice to have a cute summary that get’s you up to speed in 15 minutes or so.

I’m reading OpenIntro Statistics - Fourth Edition (free to download). This is a summary of chapters 1 and 2, and covers topics like: types of variables, sampling, causality, experimental design, mean, standard Deviation, variance, bessel’s correction, visualization, robust statistics, box plots, quartiles, median, data transformation, and hypothesis testing.

How do if a cure for a disease actually works? How do we know if - yeah, you guessed it - STATS.

Clinical Experiments

Treatment group: Patients in the treatment group receive the medication.

Control group: Patients in the control group receive a placebo.

Group	Improvement	No Improvement	Total
Treatment (n=100)	80	20	100
Control (n=100)	10	90	100
Total	90	110	200

(a) What percent of patients in the treatment group experienced improvement in symptoms? 80/100 * 100 = 80% (b) What percent experienced improvement in symptoms in the control group? 10% (c) In which group did a higher percentage of patients experience improvement in symptoms? The treatment group had a higher percentage (80%) than the control group (10%). (d) One other possible explanation for the observed difference: While the data suggests that the treatment (possibly an antibiotic) is more effective than the placebo, another possible explanation is:

The placebo group may have had more severe or prolonged cases of sinusitis at baseline, or there may have been random variation in patient characteristics (like age, immune response, or coexisting conditions) despite random assignment.

Other possibilities:

Self-reporting bias: Patients in the treatment group might expect to feel better and report improvement even if the effect is psychological (placebo effect).
Sampling error: The sample size (n=100 per group) might be too small to rule out chance variation.
Natural recovery: Sinusitis often resolves without treatment. Those in the treatment group may have coincidentally improved due to natural healing, not the medication.

Types of Variables

flowchart TD A["Variables"] A --> B["Numerical"] A --> C["Categorical"] B --> B1["Continuous"] B --> B2["Discrete"] C --> C1["Nominal"] C --> C2["Ordinal"] B1 --> E1["e.g. Height, Temperature, Weight"] B2 --> E2["e.g. Number of Children, Shoe Size"] C1 --> E3["e.g. Gender, Eye Color, Country"] C2 --> E4["e.g. Education Level, Customer Satisfaction"]

Sampling

Sampling is selecting a part of a population to study.

Simple random sample: Everyone has an equal chance of being chosen.
Stratified sample: Population is split into groups (strata), and a random sample is taken from each group.
Cluster sample: Population is divided into clusters, some clusters are randomly chosen, and all members in them are surveyed.
Multistage sample: Combines several sampling methods (e.g., choose clusters, then randomly select people within them).

Sampling Biases

Non-response bias: Some people don’t respond, possibly affecting results.
Convenience sample: Picking people who are easy to reach.
Voluntary response bias: Only those with strong opinions respond.
Undercoverage bias: Some groups are left out of the sample.

Observational Studies and Causality

In observational studies, no treatment is applied or withheld. They can show associations, but not causation.

An observational study finds that people who use more sunscreen are more likely to get skin cancer. Does sunscreen cause cancer? Likely not.

flowchart LR A["Sun Exposure"] B["Skin Cancer"] C["Use Sunscreen"] A --> B A --> C C -. ? .-> B

A confounding variable—like sun exposure—affects both sunscreen use (explanatory) and skin cancer risk (response). People in the sun more often both use more sunscreen and are more at risk.

Relationship between Sampling, Assignment, and Study Outcomes

	Random Assignment	No Random Assignment
Random Sampling	causal and generalizable	not causal, but generalizable
No Random Sampling	causal, but not generalizable	neither causal nor generalizable

Principles of Experimental Design

Randomized experiments are based on four key principles:

Controlling: Keep variables consistent across groups to isolate the treatment effect. Example: All patients take a pill with 12 oz of water to control water intake differences.
Randomization: Randomly assign participants to groups to balance out unknown or uncontrollable variables and avoid bias.
Replication: Use a large sample to improve accuracy. Repeating studies also helps verify findings.
Blocking: Group participants by variables known to affect the outcome (e.g., risk level) before randomizing within each group. Ensures treatment groups are balanced on those variables.

First three principles are essential in any experiment. Blocking is optional but useful when applicable.

Reducing Bias in Human Experiments

Randomized experiments are ideal for studying cause-and-effect but can still suffer from bias, especially in human studies.

Emotional Effects & Bias: Participants may react emotionally—those receiving treatment might feel hopeful, while control group members may feel neglected. This emotional effect can unintentionally bias results.
Blinding: To reduce bias, researchers blind patients so they don’t know whether they’re receiving the treatment or not.
Placebo: A fake treatment (e.g., sugar pill) given to control group participants to maintain blinding. Placebo Effect: When patients show slight improvement simply from believing they’re receiving treatment.
Double-Blind Studies: Doctors and researchers can also introduce bias. A double-blind study ensures that neither the patients nor the medical staff know who is receiving the actual treatment. This helps maintain objectivity and improves the reliability of results.

Random Question - Chia seeds and weight loss

Chia Pets – those terra-cotta figurines that sprout fuzzy green hair – made the chia plant a household name. But chia has gained an entirely new reputation as a diet supplement. In one 2009 study, a team of researchers recruited 38 men and divided them randomly into two groups: treatment or control. They also recruited 38 women, and they randomly placed half of these participants into the treatment group and the other half into the control group. One group was given 25 grams of chia seeds twice a day, and the other was given a placebo. The subjects volunteered to be a part of the study. After 12 weeks, the scientists found no significant difference between the groups in appetite or weight loss.

(a) What type of study is this?
Experimental – participants were randomly assigned and conditions were controlled.

(b) What are the experimental and control treatments?
Treatment: 25g chia seeds twice daily
Control: Placebo

(d) Has blinding been used?
Not explicitly stated, but likely single-blind due to use of a placebo.

(e) Can we make a causal statement? Can we generalize the results?
Causal: Yes, due to random assignment.
Generalization: Limited. Volunteers and small sample size reduce broader applicability.

Flawed Reasoning - Survey and Observational Study Biases

(a) Students at an elementary school are given a questionnaire that they are asked to return after their parents have completed it. One of the questions asked is, "Do you find that your work schedule makes it difficult for you to spend time with your kids after school?" Of the parents who replied, 85% said "no". Based on these results, the school officials conclude that a great majority of the parents have no difficulty spending time with their kids after school.
Flaw: Non-response bias – only those who responded are included, possibly skewing the results. The question is also leading, possibly inducing socially desirable answers.
Improvement: Increase response rate with follow-ups or incentives. Rephrase question neutrally (e.g., use options like "Often", "Sometimes", etc.). Acknowledge and account for non-response bias.

(b) A survey is conducted on a simple random sample of 1,000 women who recently gave birth, asking them about whether or not they smoked during pregnancy. A follow-up survey asking if the children have respiratory problems is conducted 3 years later. However, only 567 of these women are reached at the same address. The researcher reports that these 567 women are representative of all mothers.
Flaw: Attrition bias – only 567 of the 1,000 women could be reached for the follow-up. The remaining group may not be representative.
Improvement: Better tracking methods, compare demographics of those reached vs. not reached, adjust using statistical techniques, and report limitations due to attrition.

(c) An orthopedist administers a questionnaire to 30 of his patients who do not have any joint problems and finds that 20 of them regularly go running. He concludes that running decreases the risk of joint problems.
Flaw: Selection bias – only surveys his own healthy patients. No control group. Possibility of reverse causality (healthy joints lead to running).
Improvement: Use a prospective design tracking runners and non-runners over time. Include a control group and sample a more diverse population.

Mean, Standard Deviation, and Variance

Mean (Average): The mean is the average value of a dataset. $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$ Where:
- $x_i$ are the individual data values
- $\bar{x}$ is the sample mean
- $n$ is the number of values
Variance measures how spread out the data is around the mean. $s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2$ Population Variance: $\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$
Standard Deviation is the square root of variance $s = \sqrt{s^2}$ It is in the same unit as the data and describes how far data values typically are from the mean.

The variance is the average squared distance from the mean. The standard deviation is the square root of the variance. The standard deviation is useful when considering how far the data are distributed from the mean.

The standard deviation represents the typical deviation of observations from the mean. Usually about 70% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations. However, these percentages are not strict rules.

Bessel’s Correction

In sample statistics, we divide by $n - 1$ instead of $n$ when calculating sample variance.

Why?
Because using $n$ tends to underestimate the population variance.
Dividing by $n - 1$ corrects this bias and gives an unbiased estimator.

Sample Data

Value ($x_i$)
2
4
4
4
5
5
7
9

$n = 8$
$\bar{x} = \frac{2 + 4 + 4 + 4 + 5 + 5 + 7 + 9}{8} = 5$

Deviations and Squared Deviations:

$x_i$	$x_i - \bar{x}$	$(x_i - \bar{x})^2$
2	-3	9
4	-1	1
4	-1	1
4	-1	1
5	0	0
5	0	0
7	2	4
9	4	16
	Total	32

Sample Variance:

\[s^2 = \frac{32}{8 - 1} = \frac{32}{7} \approx 4.57\]

Sample Standard Deviation:

\[s = \sqrt{4.57} \approx 2.14\]

Box Plots, Quartiles, and the Median

Box Plot: A box plot is a graphical summary of data that shows the distribution’s center and spread. It displays:
- Minimum
- First quartile (Q1)
- Median (Q2)
- Third quartile (Q3)
- Maximum The “box” represents the interquartile range (IQR), from Q1 to Q3. “Whiskers” extend from the box to the minimum and maximum values, often excluding outliers.
Median (Q2): The median is the middle value of a dataset when ordered from smallest to largest. If there are an even number of observations, it is the average of the two middle values.
Quartiles: Quartiles split the data into four equal parts:
- Q1 (first quartile): the median of the lower half (25th percentile)
- Q2 (second quartile): the overall median (50th percentile)
- Q3 (third quartile): the median of the upper half (75th percentile)
Interquartile Range (IQR): measures the middle 50% of the data: $\text{IQR} = Q3 - Q1$ It is a measure of spread that is not affected by extreme values or outliers. Example: Suppose we have the following sorted dataset: 3, 5, 7, 8, 9, 10, 12, 13, 15
- Median (Q2) = 9
- Q1 = median of [3, 5, 7, 8] = (5 + 7)/2 = 6
- Q3 = median of [10, 12, 13, 15] = (12 + 13)/2 = 12.5
- IQR = 12.5 - 6 = 6.5
Outliers An observation is typically considered an outlier if it is:
- Below Q1 - 1.5 * IQR
- Above Q3 + 1.5 * IQR

Right skew and left skew data

-> right -> mean is to the right of the median but the data is bunched up at the left side in a left skewed data distribution, less than 50% of the data are below the data mean

Measures of Spread in Data

Once we understand the center of a distribution, it’s important to explore spread, or how much the data vary.

Visual Understanding: Two distributions can have the same center but different spreads. A narrow curve means low variability, while a wider curve means high variability.

Common Measures of Spread

Range
- Formula: Max − Min
- Issue: Relies only on extremes (outliers), so it’s unreliable.
Variance
- Formula:
  $s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$
- Measures average squared distance from the mean.
- Units are squared (e.g., years²), which are hard to interpret.
- Why square?
  - Eliminates negatives (so they don’t cancel out)
  - Amplifies larger deviations
Standard Deviation (SD)
- Formula: Square root of the variance
- Returns the spread in the same unit as the original data
- Symbols:
  - $s$ for sample standard deviation
  - $\sigma$ for population standard deviation
- Preferred for interpretability
Interquartile Range (IQR)
- Formula: $Q3 - Q1$
- Represents the range of the middle 50% of the data
- Less affected by outliers compared to the range
- Useful for comparing distributions via box plots

Variability vs. Diversity

These two are often confused but represent different ideas:

Diversity refers to the variety or composition of categories.
- Example: A set of cars with all different colors is more diverse than one with repeated colors.
Variability refers to numerical spread in data values.
- Example: Two car mileage sets:
  - Set 1: 10, 20, 30, 50, 50 (closer to mean → less variable)
  - Set 2: 10, 10, 10, 50, 50 (values farther from mean → more variable)

In short:

Diversity = “How many types?”
Variability = “How far apart?”

Both are important, but they serve different analytical purposes.

Robust Statistics

The median and IQR are called robust statistics because extreme observations have little effect on their values: moving the most extreme value generally has little influence on these statistics. On the other hand, the mean and standard deviation are more heavily influenced by changes in extreme observations, which can be important in some situations.

Transforming Data

When data are very strongly skewed, we sometimes transform them so they are easier to model. Like using a log transform

https://en.wikipedia.org/wiki/Data_transformation_(statistics)

Data Visualization Tips

Don’t use pie charts
Hollow histograms when comparing two or more things
Mosaic plots help visualize differences between groups of people and the decisions they make.

Hypothesis Testing

Data scientists are sometimes called upon to evaluate the strength of evidence. When looking at the rates of infection for patients in the two groups in this study, what comes to mind as we try to determine whether the data show convincing evidence of a real difference?

This is a reminder that the observed outcomes in the data sample may not perfectly reflect the true relationships between variables since there is random noise. While the observed difference in rates of infection is large, the sample size for the study is small, making it unclear if this observed difference represents efficacy of the vaccine or whether it is simply due to chance. We label these two competing claims, H0 and HA, which are spoken as “H-nought” and “H-A”:

H0: Independence model
The variables treatment and outcome are independent. They have no relationship, and the observed difference between the proportion of patients who developed an infection in the two groups, 64.3%, was due to chance.

HA: Alternative model
The variables are not independent. The difference in infection rates of 64.3% was not due to chance, and the vaccine affected the rate of infection.

What would it mean if the independence model, which says the vaccine had no influence on the rate of infection, is true? It would mean 11 patients were going to develop an infection no matter which group they were randomized into, and 9 patients would not develop an infection no matter which group they were randomized into. That is, if the vaccine did not affect the rate of infection, the difference in the infection rates was due to chance alone in how the patients were randomized.

Now consider the alternative model: infection rates were influenced by whether a patient received the vaccine or not. If this was true, and especially if this influence was substantial, we would expect to see some difference in the infection rates of patients in the groups.

We choose between these two competing claims by assessing if the data conflict so much with H0 that the independence model cannot be deemed reasonable. If this is the case, and the data support HA, then we will reject the notion of independence and conclude the vaccine was effective.

Inference via Simulation: Gender Discrimination Case Study

This case study demonstrates statistical inference using simulation to test for gender discrimination in promotion decisions.

Background

In 1972, 48 male bank supervisors were each given a personnel file and asked to decide whether the candidate should be promoted to branch manager. Files were identical, except for the gender:

24 showed a male candidate
24 showed a female candidate
Assignments were randomized

Observed Results

Males promoted: 21/24 ≈ 88%
Females promoted: 14/24 ≈ 58%

Observed difference in promotion rate: $0.88 - 0.58 = 0.30 \quad \text{or} \quad 30\%$

Hypotheses

Null Hypothesis ($H_0$)

There is no gender discrimination. Promotion decisions and gender are independent. The observed difference is due to chance.

Alternative Hypothesis ($H_A$)

There is gender discrimination. Promotion decisions and gender are dependent. The observed difference is not due to chance.

Hypothesis Testing Analogy

Hypothesis testing is like a court trial:

$H_0$: The defendant is innocent
$H_A$: The defendant is guilty

We collect data (evidence), then ask: could this data plausibly have happened by chance under $H_0$?

If yes: Fail to reject $H_0$
If no: Reject $H_0$ in favor of $H_A$

Simulation Process

To simulate the scenario under $H_0$ (no discrimination), use a standard deck of playing cards:

Face cards = Not promoted
Number cards = Promoted

Setup

Remove 3 aces and 1 number card to simulate 48 files:

35 number cards (promoted)
13 face cards (not promoted)

Steps

Shuffle the 48 cards
Split into two groups of 24 (males and females)
Count number cards in each group = promoted files
Calculate the proportion promoted in each group
Take the difference in proportions
Repeat many times to build a distribution

Example Simulation Outcome

Male group: 18 number cards → 75% promoted
Female group: 17 number cards → 70.8% promoted

Difference in simulated promotion rates: $0.75 - 0.708 \approx 0.042 \quad \text{or} \quad 4.2\%$

Simulation Distribution

Repeat the simulation many times (e.g., 100+) and build a distribution of differences under $H_0$.

This distribution is centered around: $\text{Null value} = 0$

The original observed difference was 30%. If this result is rare under the simulated distribution, we reject $H_0$.

Decision Rule

If the observed difference is typical in the simulation → Fail to reject $H_0$
If the observed difference is rare → Reject $H_0$

Conclusion

A 30% difference is rare in simulations assuming no discrimination. Therefore, we reject the null hypothesis.

There is convincing evidence of an association between gender and promotion decisions.

Key Concept: P-value

The p-value is the probability of observing a result at least as extreme as the original, assuming $H_0$ is true.

A small p-value suggests that the observed data is unlikely under $H_0$, and provides evidence in favor of $H_A$.