Confidence Intervals
https://gallery.shinyapps.io/CLT_mean/
We shall cover key concepts such as sampling variability, confidence intervals, and hypothesis testing, with a real-world example from a 2011 Pew Research Center survey.
Survey Context
- Title: Young, Underemployed, and Optimistic
- Method: Telephone interviews (Dec 6–19, 2011)
- Sample: 2,048 adults (18+ years) in the continental U.S.
- Purpose: Measure public opinion on which age group is struggling most in the economy
Key Findings
- 41% ± 2.9% (95% CI): Estimated that 38.1% to 43.9% of the public believe young adults are having the toughest time.
- 49% ± 4.4% (95% CI): Among adults aged 18–34, estimated that 44.6% to 53.4% took unwanted jobs just to pay bills.
Margin of Error
- General sample: ±2.9%
- Subsample (18–34): ±4.4%
- Confidence Level: 95%
Statistical Concepts Introduced
- Point Estimate: Sample statistic used to estimate population parameters (e.g. 41%, 49%)
- Confidence Interval (CI): A range that likely contains the true population parameter with a given level of confidence (e.g., 95%)
- Sampling Variability: Different samples yield different estimates
- Central Limit Theorem (CLT): Explains how the distribution of sample means approximates a normal distribution under certain conditions
- Inference Techniques:
- Confidence Intervals
- Hypothesis Testing
- Statistical vs Practical Significance: Emphasis on interpreting results beyond just p-values
- Statistical Power: Discussed in the context of effect size, sample size, and significance level
Sampling and Sample Distributions
Suppose we have a population of interest.
We take a random sample from this population and calculate a sample statistic, such as the sample mean ($\bar{x}$).
Now suppose we repeat this process many times:
- Take a random sample
- Calculate the sample mean
- Record the result
- Repeat steps 1–3 many times
Each sample will have its own sample distribution — this is the distribution of individual values within that specific sample (e.g., heights of 100 randomly selected women).
Sampling Distribution
The sampling distribution is the distribution of a sample statistic (like $\bar{x}$) across many samples.
- Each data point in this distribution is not an individual from the population, but a summary statistic from a sample.
- In our example: each point is a sample mean $\bar{x}_1, \bar{x}_2, \bar{x}_3, \dots$
Example
Suppose we’re interested in the average height of U.S. women.
Let’s define:
- Population mean: $\mu$
- Population standard deviation: $\sigma$
- Sample size: $n$
If we take many random samples of size $n$, the sampling distribution of the sample mean $\bar{x}$ will have:
- Mean: $\mu_{\bar{x}} = \mu$
- Standard deviation (Standard Error):
\(\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\)
This is where the Central Limit Theorem becomes essential.
Central Limit Theorem (CLT)
Statement:
If random samples of size $n$ are taken from any population with mean $\mu$ and standard deviation $\sigma$, then the sampling distribution of the sample mean $\bar{x}$:
- Will be approximately normal
- Has mean $\mu_{\bar{x}} = \mu$
- Has standard deviation $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$
As long as:
- $n$ is large (typically $n \geq 30$)
- The population is not extremely skewed or has extreme outliers
Population and Parameter
Our population of interest is U.S. women, denoted with size $N$.
The parameter of interest is the average height of all women in the U.S., denoted as $\mu$.
From complete data, we assume $\mu = 65$ inches.
The population standard deviation is denoted by $\sigma$.
Sampling Across States
We take random samples of 1000 women from each U.S. state.
For each state, we denote individual observations as $x_{s,i}$, where $s$ is the state and $i = 1, \dots, 1000$.
Each state’s sample yields a sample mean $\bar{x}_s$.
This results in 50 sample means, one per state, forming the sampling distribution.
Properties of the Sampling Distribution
- The mean of the sampling distribution $\mu_{\bar{x}}$ will be approximately equal to $\mu$, the population mean.
- The spread (standard deviation) of the sampling distribution is the standard error, denoted as:
- Because $n = 1000$, the standard error is much smaller than $\sigma$.
- For example, if $\sigma = 20$, then:
Effect of Increasing Sample Size
As sample size $n$ increases:
- The variability in sample means decreases.
- The sampling distribution becomes narrower (i.e., less spread).
- This is evident visually and numerically.
For example:
- With $n = 45$, $SE \approx 2.93$
- With $n = 500$, $SE$ is even smaller, producing a skinnier sampling distribution
Central Limit Theorem (CLT)
The Central Limit Theorem states:
The distribution of sample statistics (like the sample mean) is nearly normal, centered at the population mean, and has standard deviation equal to the population standard deviation divided by the square root of the sample size.
Mathematical Formulation
\[\bar{x} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]- Shape: Nearly normal
- Center: $\mu$ (population mean)
- Spread: Standard error = $\displaystyle \frac{\sigma}{\sqrt{n}}$
If $\sigma$ (population standard deviation) is unknown, use the sample standard deviation $s$ to estimate standard error.
Conditions for the CLT to Apply
- Independence
- Observations must be independent.
- Achieved through:
- Random sampling (observational studies)
- Random assignment (experiments)
- If sampling without replacement: ensure $n < 10\%$ of the population.
- Sample Size / Skew
- If the population is normal, the sample size does not matter.
- If the population is skewed or unknown:
- Sample size should be large for CLT to apply.
- Rule of thumb: $n > 30$ for moderately skewed distributions.
10% Condition for Independence
If sampling without replacement, $n$ should be less than 10% of the population. This ensures approximate independence among observations.
Example: In a population of 1000, a sample size of 10 makes it unlikely to include multiple members of the same family. But if 500 are sampled, the chance of including genetically related individuals increases, violating independence.
When sampling with replacement, each draw is independent by design, so the 10% condition does not apply. However, in realistic survey settings, we sample without replacement, so the 10% condition becomes important.
Sample Size and Skew
The shape of the sampling distribution depends on the population distribution and sample size.
If the population distribution is normal, the sampling distribution of the sample mean is nearly normal regardless of sample size.
If the population distribution is skewed:
- Small $n$ leads to a skewed sampling distribution
- Larger $n$ yields a more symmetric, unimodal, nearly normal sampling distribution
Rule of thumb: For moderately skewed populations, CLT applies when $n > 30$
Simulation Examples
- Extremely right-skewed population: $n = 10$ leads to skewed sampling distribution
- Increase to $n = 100$ or $n = 200$: sampling distribution becomes more symmetric and closer to normal
- Uniform population distribution with $n = 15$: sampling distribution already looks symmetric and unimodal
- Right-skewed population with $n = 15$: sampling distribution is skewed
- Increase to $n = 500$: distribution appears nearly normal
- Left-skewed population: $n = 500$ yields nearly normal sampling distribution; $n = 24$ or $n = 12$ yields skewed distribution
If the population is not very skewed, even a small $n = 12$ may result in a nearly normal sampling distribution
Guided Practice: Sampling Variability
Question: Which of the following would you expect to be more variable?
- Overall population distribution of all women in the US.
- Distribution of sample means of random samples of 1000 women from each state.
Answer: Overall population distribution of all women in the US.
Explanation: The sample mean is an average of many observations, and due to the Central Limit Theorem, the distribution of sample means will have lower variability and be more tightly clustered around the population mean. In contrast, individual observations in the full population vary more widely, making the overall population distribution more variable.
Guided Practice: Visualizing Distribution Shape
Question: Which of the below visualizations is not appropriate for checking the shape of the distribution of the sample, and hence the population?
- histogram
- boxplot
- normal probability plot
- barplot
Answer: barplot
Explanation: A barplot is used for displaying categorical data, not numerical distributions. To check the shape of a numerical sample distribution, visualizations like histograms, boxplots, and normal probability plots are appropriate, as they display the spread, skew, and modality of the data.
Suppose my iPod has 3,000 songs. The histogram below shows the distribution of the lengths of these songs. We also know that, for this iPod, the mean length is 3.45 minutes and the standard deviation is 1.63 minutes. Calculate the probability that a randomly selected song lasts more than 5 minutes.
We are asked to calculate $P(X > 5)$, where $X$ is the length of a randomly selected song.
Although we are given the mean and standard deviation, the distribution is clearly right-skewed and not approximately normal, so using Z-scores or normal distribution tables is not appropriate.
Instead, we can directly estimate the probability from the histogram. From the histogram bars, we observe:
- 5–6 minutes: 350 songs
- 6–7 minutes: 100 songs
- 7–8 minutes: 25 songs
- 8–9 minutes: 20 songs
- 9–10 minutes: 5 songs
So the number of songs longer than 5 minutes is:
\[350 + 100 + 25 + 20 + 5 = 500\]Since there are 3,000 songs in total, the probability that a randomly selected song lasts more than 5 minutes is:
\[P(X > 5) = \frac{500}{3000} = 0.17\]So, approximately 17% of the songs on the iPod last more than 5 minutes.
Playlist Duration Probability
I’m about to take a trip to visit my parents and the drive is 6 hours. I make a random playlist of 100 songs. What is the probability that my playlist lasts the entire drive?
We need the playlist to last at least 6 hours, or 360 minutes. The total length of 100 songs should exceed 360 minutes:
\[P(X_1 + X_2 + \cdots + X_{100} > 360)\]This is equivalent to:
\[P(\bar{X} > 3.6)\]where $\bar{X}$ is the average length of the 100 randomly chosen songs.
By the Central Limit Theorem:
\[\bar{X} \sim N(\mu = 3.45, SE = \frac{\sigma}{\sqrt{n}} = \frac{1.63}{\sqrt{100}} = 0.163)\]Now, we compute the Z-score:
\[Z = \frac{3.6 - 3.45}{0.163} = 0.92\]Using the standard normal distribution:
\[P(\bar{X} > 3.6) = P(Z > 0.92) \approx 0.179\]So, there is approximately a 17.9% chance that the playlist will last the entire 6-hour drive.
Explanation of Central Limit Theorem Application
By the Central Limit Theorem (CLT), when we take a large enough sample (usually $n \geq 30$) from any population with a finite mean and standard deviation, the sampling distribution of the sample mean $\bar{X}$ will be approximately normal.
In this case, we are sampling $n = 100$ songs from a population where:
- the population mean $\mu = 3.45$ minutes
- the population standard deviation $\sigma = 1.63$ minutes
According to the CLT:
\[\bar{X} \sim N(\mu, SE)\]where $SE$ is the standard error of the mean, given by:
\[SE = \frac{\sigma}{\sqrt{n}}\]Plugging in our values:
\[SE = \frac{1.63}{\sqrt{100}} = \frac{1.63}{10} = 0.163\]So, the sampling distribution of the average song length from a playlist of 100 songs is approximately normal, with:
\[\bar{X} \sim N(3.45, 0.163)\]This allows us to compute probabilities using the standard normal (Z) distribution.
Matching Distributions with Sampling Contexts
We are given four histograms and asked to match each with one of the following descriptions:
- The distribution for a population ($\mu = 10, \sigma = 7$)
- A single random sample of 100 observations from this population
- A distribution of 100 sample means from random samples with size 7
- A distribution of 100 sample means from random samples with size 49
To do this, we use the Central Limit Theorem, which says that the sampling distribution of the sample mean becomes approximately normal as the sample size $n$ increases, regardless of the shape of the original population.
Reasoning:
-
The Population distribution is given directly: it is right-skewed with large variability. This is the bottom-left plot.
-
Plot C is the most symmetric and bell-shaped. Since the CLT says that large samples make the sampling distribution of the mean more normal, this must be the distribution of sample means with $n=49$.
-
Plot B is also right-skewed, and it has a wide spread, similar to the population distribution. This must be the single random sample of 100 observations.
-
Plot A is a bit more symmetric than B but still rough, with moderate spread. It is the distribution of sample means with $n=7$.
Final Matching:
- Population distribution: Bottom-left large plot
- Plot A: Distribution of 100 sample means ($n = 7$)
- Plot B: Single random sample of 100 observations
- Plot C: Distribution of 100 sample means ($n = 49$)
Guided Practice
Given that the population is right skewed, which of the following distributions will resemble a normal distribution most closely?
- a single random sample of 100 observations from this population
- a distribution of 100 sample means from random samples with size 7
- a distribution of 100 sample means from random samples with size 49
Answer: A distribution of 100 sample means from random samples with size 49.
Explanation: The Central Limit Theorem tells us that the sampling distribution of the sample mean becomes approximately normal as the sample size increases, regardless of the shape of the original population. Since the population is right skewed, individual samples or small sample sizes will still reflect this skewness. However, when we take larger samples (e.g., of size 49), the distribution of the sample means will be much closer to normal. Thus, the third option will resemble a normal distribution most closely.
see:
Behavioral Asymmetry Question
One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth. This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing. The standard error associated with this estimate is roughly 4%. Which of the below is false?
(a) A higher sample size would yield a lower standard error.
(b) The margin of error for a 95% CI for the percentage of kissers who turn their heads to the right is roughly 8%.
(c) The 95% CI for the percentage of kissers who turn their heads to the right is roughly 64.5% ± 4%.
(d) The 99.7% CI for the percentage of kissers who turn their heads to the right is roughly 64.5% ± 12%.
Answer: (c) The 95% CI for the percentage of kissers who turn their heads to the right is roughly 64.5% ± 4%.
The standard error (SE) is given as 4%, and confidence intervals (CIs) are computed using:
\[\text{Margin of Error} = z^* \cdot \text{SE}\]For a 95% CI, the critical value is approximately $z^* = 2$, so:
\[\text{Margin of Error}_{95\%} \approx 2 \cdot 4\% = 8\%\]Thus, the 95% CI should be approximately:
\[64.5\% \pm 8\%\]Option (b) is true.
Option (d) is also true: for a 99.7% CI, $z^* \approx 3$, so:
Option (a) is true because increasing sample size reduces the standard error. (A higher sample size would yield a lower standard error. We know that this is always true. We’ve seen this with the central limit theorem as well. Conceptually this is because the higher your sample sizes, the less variable your point estimates from those samples are going to be. Mathematically speaking, the standard error is always sigma over square root of n so that n and the standard error are going to be inversely proportional, in other words if n goes up, the standard error is going to go down, so this is correct.) Option (c) is false because it uses the standard error directly instead of the full margin of error for a 95% confidence level.
Confidence interval for a population mean
A confidence interval for a population mean is computed as:
\[\bar{x} \pm z^* \cdot \frac{s}{\sqrt{n}}\]where $\bar{x}$ is the sample mean, $z^*$ is the critical value corresponding to the desired confidence level, $s$ is the sample standard deviation, and $n$ is the sample size.
Conditions for using the confidence interval
- Independence
- Observations must be independent
- Achieved via random sampling or random assignment
- If sampling without replacement: $n < 10\%$ of the population
- Sample size and skew
- If population distribution is normal, any sample size is acceptable
- Otherwise, $n \geq 30$, or larger if population distribution is very skewed
Determining the critical value
The critical value $z^*$ corresponds to the middle percentage of the normal distribution associated with the confidence level. For example:
- For a 95% confidence interval:
- Middle area = 0.95
- Tails = $(1 - 0.95)/2 = 0.025$ on each side
- Look up 0.025 in the standard normal table
- Lower bound $z = -1.96$, upper bound $z = 1.96$
- So, $z^* = 1.96$
- For a 98% confidence interval:
- Tails = $(1 - 0.98)/2 = 0.01$ on each side
- Cumulative from left = $1 - 0.01 = 0.99$
- $z^* = \text{qnorm}(0.99) = 2.33$
Accuracy and precision of confidence intervals
Accuracy refers to whether or not the confidence interval contains the true population parameter. Precision refers to the width of the confidence interval.
Confidence level
The confidence level is the proportion of intervals (constructed from repeated random samples) that are expected to contain the true population parameter. For example, using the formula:
\[\text{CI} = \text{point estimate} \pm z^* \cdot \text{SE}\]about 95% of intervals constructed this way from repeated samples will contain the true population mean $\mu$.
Confidence level is a choice
In practice, we typically select the confidence level rather than calculate it. Common levels are 90%, 95%, 98%, and 99%. The confidence level affects the value of the critical value $z^*$.
Confidence level and width
As the confidence level increases, the critical value $z^*$ increases, which increases the margin of error and hence the width of the confidence interval:
\[\text{Margin of error} = z^* \cdot \text{SE}\]A higher confidence level leads to greater accuracy (more likely to capture $\mu$), but lower precision (wider interval).
Trade-off example
A weather forecast stating tomorrow’s temperature will be between $-20^\circ$F and $110^\circ$F is accurate but not precise. Wide intervals may be accurate but not informative.
Increasing both accuracy and precision
To achieve both higher accuracy and higher precision, increase the sample size $n$. This reduces the standard error:
\[\text{SE} = \frac{s}{\sqrt{n}}\]Reducing SE decreases the margin of error while keeping the same confidence level, leading to narrower (more precise) intervals that still capture the true parameter reliably.
Evaluating statements about a confidence interval
Based on a 2010 GSS survey of 1,154 U.S. residents, a 95% confidence interval for the average number of hours Americans relax after a workday was found to be 3.53 to 3.83 hours. Let’s evaluate the truth of each statement using the explanations from the video.
(a) 95% of Americans spend 3.53 to 3.83 hours relaxing after a work day.
False
This statement incorrectly interprets the confidence interval as describing individuals in the population. A confidence interval gives a range for the average (the population mean), not individual data points. The interval says nothing about what percentage of individuals fall within it. So this is a misuse of the confidence interval.
(b) 95% of random samples of 1,154 Americans will yield confidence intervals that contain the true average number of hours Americans spend relaxing after a work day.
True
This is the correct interpretation of a 95% confidence level. If we repeatedly took random samples of size 1,154 and constructed a 95% confidence interval for each, about 95% of those intervals would contain the true population mean. This aligns directly with the definition of confidence level.
(c) 95% of the time the true average number of hours Americans spend relaxing after a work day is between 3.53 and 3.83 hours.
False
This suggests that the population mean is a moving target or that it sometimes falls in the interval and sometimes doesn’t. But the population mean is fixed—it either is or isn’t in the interval we computed. The randomness is in the sampling process, not in the population parameter. Therefore, this statement misrepresents what a confidence interval means.
(d) We are 95% confident that Americans in this sample spend on average 3.53 to 3.83 hours relaxing after a work day.
False
This describes the sample mean, but we already know the sample mean—it lies at the center of the interval. We are 100% sure what the sample mean is. Confidence intervals are constructed to estimate the population mean, not the mean of the sample. Therefore, this statement confuses the known sample statistic with the unknown population parameter.
Evaluating statements about the GSS confidence interval
The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. In 2010, the survey collected responses from 1,154 US residents. Based on the survey results, a 95% confidence interval for the average number of hours Americans have to relax or pursue activities that you enjoy after an average work day is 3.53 to 3.83 hours. Which of the following is false?
We are given:
- Sample size $n = 1154$
- 95% confidence interval: 3.53 to 3.83 hours
Let’s go through each statement to determine which is false, and explain why.
-
Increasing the confidence level would result in a more accurate but less precise confidence interval.
True
A higher confidence level (e.g., 99%) would widen the interval, making it more likely to contain the true population mean (higher accuracy). However, this comes at the cost of a wider (less precise) interval. So, this statement is correct. -
The standard error is approximately 0.075 hours.
We are given a 95% confidence interval:- Lower bound = 3.53
- Upper bound = 3.83
From this, we can compute:
- Sample mean $\bar{x} = \frac{3.53 + 3.83}{2} = 3.68$
- Margin of error (half the width of the interval):
\(\text{ME} = \frac{3.83 - 3.53}{2} = 0.15\)
At 95% confidence, we use critical value $z^* \approx 1.96$.
Recall:
\(\text{Margin of Error} = z^* \times SE\)
So:
\(0.15 = 1.96 \times SE \Rightarrow SE = \frac{0.15}{1.96} \approx 0.0765\)
This rounds to approximately 0.075, so this statement is true.
-
The sample mean is 3.68 hours.
True
We already calculated the sample mean:
\(\bar{x} = \frac{3.53 + 3.83}{2} = 3.68\)
So this statement is correct. -
The margin of error is 0.3 hours.
False
We already calculated the margin of error:
\(\text{ME} = \frac{3.83 - 3.53}{2} = 0.15\)
This statement claims it’s 0.3 hours, which is double the actual value.
Backtracking to n for a Given ME
Given a target margin of error, confidence level, and information on the variability of the sample (or the population), we can determine the required sample size to achieve the desired margin of error.
The formula for the margin of error is:
\(ME = z^* \cdot \frac{s}{\sqrt{n}}\)
Rearranging to solve for sample size:
\(n = \left( \frac{z^* \cdot s}{ME} \right)^2\)
As the sample size $n$ increases, the standard error $\frac{s}{\sqrt{n}}$ decreases, which makes the margin of error smaller. This results in narrower confidence intervals that maintain the same confidence level, thus increasing both accuracy and precision.
Sample Size for Desired Margin of Error
A group of researchers want to test the possible effect of an epilepsy medication taken by Dregnant mothers on the cognitive development of their children. As evidence, they want to estimate the IQ scores of three-year-old children born to mothers who were on this medication during pregnancy.
Previous studies suggest that the SD of IQ scores of three-year-old children is 18 points.
How many such children should the researchers sample in order to obtain a 90% confidence nterval with a margin of error less than or equal to 4 points?
Given:
$ \sigma = 18 $
$ z^* = 1.65 $ (for 90% confidence)
$ ME \leq 4 $
Use the formula:
\(n = \left( \frac{z^* \cdot \sigma}{ME} \right)^2\)
\(n = \left( \frac{1.65 \cdot 18}{4} \right)^2\)
\(n = (7.425)^2 = 55.13\)
However, since we can’t really have 0.13 of a person, we’re going to need to round this number. Mathematically this number should round down to 55. However, if we’re saying 55.13 is the minimum required sample size it really doesn’t make sense to go any lower then that. So even though mathematically this number will be rounded to 55, actually in calculations of minimum required sample size, regardless of the value of the decimal, we always want to round up. Therefore, we need at least 56 such children in the sample to obtain a maximum margin of error of four points. We need at least $56$ children in the sample to achieve a margin of error of 4 points or less at 90% confidence.
Impact of Decreasing Margin of Error
We found that we needed at least 56 children in the sample to achieve a maximum margin of error of 4 points. How would the required sample size change if we want to further decrease the margin of error to 2 points?
From the margin of error formula:
\(ME = \frac{z^* \cdot \sigma}{\sqrt{n}}\)
If we reduce the margin of error from $4$ to $2$ (cutting it in half), the sample size must increase by a factor of $4$ (since $n$ is under a square root).
So, new required sample size:
\(n_{\text{new}} = 4 \cdot 56 = 224\)
To decrease the margin of error to $2$ points, we need at least $224$ children in the sample.
Sample Size to Narrow Confidence Interval
A given confidence interval is calculated based on a random sample of n observations. If we want to make this interval narrower (1/3 of what it is now), how many observations should we sample?
From the margin of error formula:
\(ME = \frac{z^* \cdot \sigma}{\sqrt{n}}\)
Let $ME_{\text{new}} = \frac{1}{3} ME_{\text{old}}$
Then:
\(\frac{1}{3} ME = \frac{z^* \cdot \sigma}{\sqrt{n_{\text{new}}}}\)
Divide both sides:
\(\frac{1}{3} \cdot \frac{z^* \cdot \sigma}{\sqrt{n}} = \frac{z^* \cdot \sigma}{\sqrt{n_{\text{new}}}}\)
Simplify:
\(\frac{1}{3\sqrt{n}} = \frac{1}{\sqrt{n_{\text{new}}}}\)
Take reciprocal:
\(3\sqrt{n} = \sqrt{n_{\text{new}}}\)
Square both sides:
\(9n = n_{\text{new}}\)
We need to sample $9n$ observations to narrow the interval to one-third of its original width.
Since n is under the square root sign, to make it 1/3 of what it is now, we need to increase the sample size to $3^2 = 9 n$
The General Social Survey asks: “For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?” Based on responses from 1,151 US residents, the survey reported a 95% confidence interval of 3.40 to 4.24 days in 2010.
Interpret this interval in context of the data.
We are 95% confident that Americans on average have 3.40 to 4.24 bad mental health days per month.
In this context, what does a 95% confidence level mean?
95% of random samples of 1,151 Americans will yield CIs that capture the true population mean of number of bad mental health days per month.
Suppose the researchers think a 99% confidence level would be more appropriate for this interval. Will this new interval be narrower or wider than the 95% confidence interval? As CL increases so does the width of the confidence interval, so wider.
If a new survey asking the same questions was to be done with 500 Americans, would the standard error of the estimate be larger, smaller, or about the same. Assume the standard deviation has remained constant since 2010.
smaller larger - correct about the same
If sample size decreases, all else held constant, the standard error will increase: $\text{SE}=\frac{s}{\sqrt{n}}$.
Central Limit Theorem (CLT) is about the distribution of point estimates, and that given certain conditions, this distribution will be nearly normal.
In the case of the mean, the CLT tells us that if:
-
(1a) The sample size is sufficiently large, i.e.,
( n \geq 30 ), and the data are not extremely skewed,
or -
(1b) The population distribution is known to be normal,
and -
(2) The observations in the sample are independent,
Then the distribution of the sample mean will be nearly normal, centered at the true population mean, and with a standard error of:
[ SE = \frac{\sigma}{\sqrt{n}} ]
That is,
[ \bar{x} \sim N(\text{mean} = \mu, \text{SE} = \frac{\sigma}{\sqrt{n}}) ]
When the population distribution is unknown, condition (1a) can be checked using a histogram or other visualization of the observed data.
As the sample size increases, the shape of the population distribution matters less. When ( n ) is very large, the sampling distribution of the sample mean will be approximately normal, regardless of the shape of the population distribution.
Review the associated learning objective.