logo

Percentiles and Outliers 📂Data Science

Percentiles and Outliers

Definitions 1

Given quantitative data,

  1. A value that is greater than $p \%$ but less than $(100-p) \%$ is called the $p$-percentile.
  2. The $100$-percentile and $0$-percentile (the largest and smallest values in the data) are referred to as the maximum, minimum values, respectively.
    • The difference between the maximum and minimum values is called the data’s range $R$.
  3. The $25$-percentile is called the first quartile $Q_{1}$, and the $75$-percentile is called the third quartile $Q_{3}$.
    • $\left( Q_{3} - Q_{1} \right)$ is called the interquartile range $\text{IQR}$.
  4. The minimum, first quartile, median, third quartile, and maximum are the five statistics called the Five-Number Summary. $$ \min \qquad Q_{1} \qquad \text{median} \qquad Q_{3} \qquad \max $$
  5. Empirically, data that falls outside the following range is also referred to as an outlier. $$ \left[ Q_{1} - 1.5 \text{IQR} , Q_{3} + 1.5 \text{IQR} \right] $$ The lower limit is called the lower fence, and the upper limit is called the upper fence.

Explanation

Second Quartile

The $50$-percentile, aka the second quartile, is essentially the median, so there is no need to define it separately when talking about the five-number summary. These summaries help to make an educated guess about the distribution of the data with a sufficient amount of data, and they should be the first thing to check regardless of the data being observed.

Outlier

An outlier is literally something that lies outside, meaning it falls outside the common range of data. Despite $Q_{1} - 1.5 \text{IQR}$ being a rather small value and $Q_{3} + 1.5 \text{IQR}$ being a rather large value, they are called outliers because they fall outside the expected range. Note that this is not a mathematically rigorous definition, as the terms ’empirical’ and ‘common data’ suggest.


  1. Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p76, 60, 78~80. ↩︎