Data Visualization and Descriptive Statistics
Learn how to summarize and visualize data effectively. Covers measures of center, spread, shape, graphical displays, and best practices for communicating data clearly.
What You'll Learn
- โCalculate and interpret measures of center (mean, median, mode) and spread (range, IQR, standard deviation).
- โChoose and create appropriate graphical displays for different data types.
- โDescribe distributions in terms of shape, center, spread, and unusual features.
1. Measures of Center and Spread
Descriptive statistics reduce a dataset to a few key numbers. Measures of center (mean, median) locate the typical value, while measures of spread (standard deviation, IQR, range) describe how much variability exists.
Key Points
- โขThe mean is sensitive to outliers; the median is resistant.
- โขStandard deviation measures average distance from the mean; IQR measures the spread of the middle 50%.
- โขUse mean and SD for symmetric data; use median and IQR for skewed data or data with outliers.
2. Graphical Displays
Graphs reveal patterns that numbers alone may miss. Histograms show the shape of a distribution, boxplots highlight quartiles and outliers, and scatterplots display relationships between two quantitative variables.
Key Points
- โขHistograms are best for displaying the shape and distribution of a single quantitative variable.
- โขBoxplots make it easy to compare distributions across groups and identify outliers using the 1.5*IQR rule.
- โขScatterplots show the direction, form, and strength of the association between two variables.
3. Describing Distributions
When describing a distribution, always address shape, center, spread, and any unusual features such as outliers or gaps. Using context-specific language makes your analysis meaningful and interpretable.
Key Points
- โขShape categories include symmetric, left-skewed, right-skewed, uniform, and bimodal.
- โขOutliers should be investigated, not automatically removed; they may contain important information.
- โขAlways describe statistics in the context of the data (e.g., "the median home price" not just "the median").
Key Takeaways
- โ The five-number summary (min, Q1, median, Q3, max) provides a complete picture of a distribution and is the basis for boxplots.
- โ For a bell-shaped distribution, approximately 68% of data fall within one standard deviation of the mean.
- โ A z-score tells you how many standard deviations an observation is from the mean, enabling comparison across different scales.
- โ Bar charts are for categorical data; histograms are for quantitative data. Do not confuse the two.
Practice Questions
1. A dataset has mean 50, median 42, and a long right tail. Describe this distribution.
2. When is a boxplot more informative than a histogram?
FAQs
Common questions about this topic
Reporting both is good practice because the comparison reveals skewness. If they are close, the distribution is approximately symmetric. If they differ substantially, the distribution is skewed and the median is more representative of the typical value.
A common rule uses the IQR: any observation below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is flagged as a potential outlier. Z-scores beyond 2 or 3 in absolute value are another indicator. Always investigate outliers in context before deciding how to handle them.