Distributions

2024-10-09

Visualizing data distributions

  • Summarizing complex datasets is crucial in data analysis, allowing us to share insights drawn from the data more effectively.

  • One common method is to use the average value to summarize a list of numbers.

  • For instance, a high school’s quality might be represented by the average score in a standardized test.

  • Sometimes, an additional value, the standard deviation, is added.

Visualizing data distributions

  • So, a report might say the scores were 680 \(\pm\) 50, boiling down a full set of scores to just two numbers.

  • But is this enough? Are we overlooking crucial information by relying solely on these summaries instead of the complete data?

  • Our first data visualization building block is learning to summarize lists of numbers or categories.

  • More often than not, the best way to share or explore these summaries is through data visualization.

Visualizing data distributions

  • The most basic statistical summary of a list of objects or numbers is its distribution.

  • Once a data has been summarized as a distribution, there are several data visualization techniques to effectively relay this information.

  • Understanding distributions is therefore essential for creating useful data visualizations.

  • Note: understanding distributions is also essential for understanding inference and statistical models

Case study: describing student heights

  • Pretend that we have to describe the heights of our classmates to someone that has never seen humans.

  • We ask students to report their height in inches.

  • We also ask them to report sex because there are two different height distributions.

     sex height
1   Male     75
2   Male     70
3   Male     68
4   Male     74
5   Male     61
6 Female     65

Case study

  • One way to convey the heights to ET is to simply send him this list of 1,050 heights.

  • But there are much more effective ways to convey this information, and understanding the concept of a distribution will be key.

  • To simplify the explanation, we first focus on male heights.

  • We examine the female height data later.

Distributions

  • The most basic statistical summary of a list of objects or numbers is its distribution.

  • For example, with categorical data, the distribution simply describes the proportion of each unique category:


Female   Male 
 0.227  0.773 

Distributions

  • To visualize this we simply use a barplot.

  • Here is an example with US state regions:

Histograms

  • When the data is numerical, the task of displaying distributions is more challenging.

  • When data is not categorical, reporting the frequency of each entry, as we did for categorical data, is not an effective summary since most entries are unique.

  • For example, in our case study, while several students reported a height of 68 inches, only one student reported a height of 68.503937007874 inches and only one student reported a height 68.8976377952756 inches.

Histograms

  • A more useful way to define a distribution for numeric data is to define a function that reports the proportion of the data below \(a\) for all possible values of \(a\).

  • This function is called the empirical cumulative distribution function (eCDF), it can be plotted, and it provides a full description of the distribution of our data.

Histograms

  • Here is the eCDF for male student heights:

Histograms

  • However, summarizing data by plotting the eCDF is actually not very popular in practice.

  • The main reason is that it does not easily convey characteristics of interest such as: at what value is the distribution centered? Is the distribution symmetric? What ranges contain 95% of the values?

  • Histograms sacrifice just a bit of information to produce plots that are much easier to interpret.

Histograms

Here is the histogram for the height data splitting the range of values into one inch intervals: \((49.5, 50.5]\), \((50.5, 51.5]\), \((51.5,52.5]\), \((52.5,53.5]\), \(...\), \((82.5,83.5]\).

  • From this plot one immediately learn some important properties about our data.

Smoothed density

  • Smooth density plots relay the same information as a histogram but are aesthetically more appealing:

Smoothed density

  • In this plot, we no longer have sharp edges at the interval boundaries and many of the local peaks have been removed.

  • The scale of the y-axis changed from counts to density. Values shown y-axis are chosen so that the area under the curve adds up to 1.

  • To fully understand smooth densities, we have to understand estimates, a concept we cover later in the course.

Smoothed density

  • Here is an example comparing male and female heights:

The normal distribution

  • Histograms and density plots provide excellent summaries of a distribution.

  • But can we summarize even further?

  • We often see the average and standard deviation used as summary statistics

  • To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

The normal distribution

  • The normal distribution, also known as the bell curve and as the Gaussian distribution.

The normal distribution

  • Many datasets can be approximated with normal distributions.

  • These include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors.

  • But how can the same distribution approximate datasets with completely different ranges for values?

The normal distribution

  • The normal distribution can be adapted to different datasets by just adjusting two numbers, referred to as the average or mean and the standard deviation (SD).

  • Because we only need two numbers to adapt the normal distribution to a dataset implies that if our data distribution is approximated by a normal distribution, all the information needed to describe the distribution can be encoded by just two numbers.

  • A normal distribution with average 0 and SD 1 is referred to as a standard normal.

The normal distribution

  • For a list of numbers contained in a vector x:
index <- heights$sex == "Male" 
x <- heights$height[index] 
  • the average is defined as.
m <- sum(x) / length(x) # or mean(x)
  • and the SD is defined as:
s <- sqrt(sum((x - m)^2) / length(x)) # or sd(x)

Warning

sd(x) is the sample standard deviation which is not exactly the same as the standard deviation.

For reasons explained in later,sd divides by length(x)-1 rather than length(x) can be used here:

n <- length(x)
sd(x)^2*(n - 1)/n - sum((x - mean(x))^2)/n
[1] -3.55e-15

The normal distribution

  • Here is a plot of our male student height smooth density (blue) and the normal distribution (black) with mean = 69.3 and SD = 3.6:

Boxplot

  • Suppose we want to summarize the murder rate distribution.

Boxplots

  • In this case, the histogram above or a smooth density plot would serve as a relatively succinct summary.

  • But what if we want a more compact numerical summary?

  • Two summaries will not suffice here because the data is not normal.

Boxplots

  • The boxplot provides a five-number summary composed of the range (the minimum and maximum) along with the quartiles (the 25th, 50th, and 75th percentiles).

  • The R implementation of boxplots ignore outliers when computing the range and instead plot these as independent points.

  • The help file provides a specific definition of outliers.

Boxplots

The boxplot sumarizes with a box with whiskers:

From just this simple plot, we know that:

  • the median is about 2.5,
  • that the distribution is not symmetric, and that
  • the range is 0 to 5 for the great majority of states with two exceptions.

Boxplots

  • In data analysis we often divide observations into groups based on the values of one or more variables associated with those observations.

  • We call this procedure stratification and refer to the resulting groups as strata.

  • Stratification is common in data visualization because we are often interested in how the distributions of variables differ across different subgroups.

  • Stratifying and then making boxplot is a common approach to visualizing these differences.

Case study continued

Here are the heights for men and women:

Case study continued

  • The plot immediately reveals that males are, on average, taller than females.

  • However, exploratory plots reveal that the approximation is not as useful:

Case study continued

  • A likely for the second bump is that female as the default in the reporting tool.

  • The unexpected five smallest values are likely cases of 5'x'' reported as 5x

[1] 51 53 55 52 52