THANK YOU FOR SUBSCRIBING
Analyzing data represents a unique form of art. It combines the discovery of patterns followed by a clear explanation of this phenomenon to the uninitiated. Among other things, a brilliant storyteller of data can trace the origin of a pandemic, locate the whereabouts of a wanted fugitive and determine which items in a massive warehouse need to be replenished without having to take inventory physically. Despite these accomplishments, some data analysts will provide misleading or useless information. Statisticians call this noise, but in laymen’s terms, it’s gibberish. The purpose of this article is to provide you with a simplistic example (summary statistics), highlight some common mistakes that arise from this approach then provide some solutions. My goal is to improve your abilities as a storyteller of data.
Summary Statistics
Summary statistics provide the audience with simplistic yet important metrics regarding a sample size or population. These metrics include the mean (average), median (midpoint or 50th percentile), minimum and maximum values, standard deviation (dispersion from the average), and frequency of values. The most important of these is the mean, so I’ll clarify the concept of average. For a particular variable, you collect the values of all participants then distribute that total uniformly among those participants. Let’s consider the Income variable, reflected in Table 1:
Table 1, Income($)
Income ($)
30,000
40,000
65,000
51,000
5,000,000,000
Applying the earlier definition produces Table 2, which shows $1,000,037,200 uniformly distributed among five participants.
Table 2 (Calculating the Mean of Income ($))
In addition to Income ($), let’s include two other variables, ID and Color in Table 3:
Table 3 (Random Neighborhood)
Table 3 is meant to reflect a neighborhood consisting of 5 homes, with corresponding incomes and exterior color. ID represents an identifier variable, it’s used to distinguish one row (record) from another because it’s possible that records can have the same values yet be distinct (i.e. two men can both be 25 years old and weigh 150 pounds). For the Color variable, 1= “Red”, 2= “Blue”, 3= “Green” and 4= “Yellow”. This is because categorical variables must be coded numerically or rigorous statistical analysis (i.e. correlation, regression) cannot be performed on non-numeric variables. To clarify, categorical values mean that each of the values are independent of each (i.e. none of the colors overlap with each other).
"Although performing mean, median, minimum, maximum, and standard deviation values on identifiers and categorical variables offer no viable information, applying frequency of values DOES provide insight"
In addition to the mean, four other summary statistics are shown in Table 4 for the Income variable:
Table 4 (Summary Statistics for Income of Random Neighborhood)
Since the listed Income values are unique (i.e. don’t repeat), the five values each have a frequency of 1.
Analyzing Table 2, we see a huge disparity between the mean (1,000,037,200) and the median (51,000). Conceptually, if extreme values (outliers) are not present, the mean and median would be much closer, if not the same. In addition, we know outliers are present, based on the following:
1) the disparity between the minimum (30,000) and maximum (5,000,000,000) and,
2) the standard deviation(1,999,981,400) is larger than the mean (1,000,037,200).
Before we can conclude that the $5B value is an outlier, we need to determine if it wasn’t coded in error (i.e. should be $500K instead of $5B). For example, if income values originated from surveys, we would verify with those documents that the income values match those in the database. Assuming that the $5B amount is legitimate, I would use my summary statistics analysis as a launch pad to perform more in-depth analysis in answering questions about income inequality as well as favorable zoning and tax abatement policies.
Common Mistakes
In Table 2, you’ll notice that I didn’t produce metrics for the ID and Color variables. Based on my earlier description of mean as well as its relationship to median, minimum, maximum and standard deviation, summary statistics on identifiers and categorical variables offers no valuable insight. To prove my point, Table 5 shows the summary statistics for ID and Color:
Table 5 (Summary Statistics of ID and Color)
The conclusions from Table 5 would be nonsensical, to say the least. For ID, the mean is 125, which suggests this value be distributed uniformly among all 5 participants. This makes no sense because, as stated earlier, all ID values are distinct. This also indicates that revealing minimum and maximum values provides no benefits because there is no limit to distinct values in either direction. For Color, the mean is 2 which translates to “Blue” being the average of mixing “Red”, “Blue”, “Green” and “Yellow” together then dispersing that combination among five participants. Despite the ridiculous nature of this approach, some analysts actually do this and believe summary statistics on identifiers and categorical variables offers insight. Hint… it doesn’t!
Solutions
Although performing mean, median, minimum, maximum and standard deviation values on identifiers and categorical variables offer no viable information, applying frequency of values DOES provide insight. It identifies duplicate values and reveals the popularity of values based on the number occurrences.
1) Identifying duplicate values
For whatever reason, there might be unnecessary replication within the database. Performing a frequency of values allows the user to identify these duplicate records.
2) Popularity based on number of occurrences
Table 3 shows that there are two instances of Red with one instance each for the remaining three colors. As a home developer, I might use this insight to build more red colored houses.
If performed properly, data analysis can help explain complex trends to the unfamiliar. I hope my walkthrough of summary statistics revealed silly mistakes and solutions for resolving those errors. By avoiding these pitfalls, you’ll craft a meaningful story for your audience.
Read Also