logo

What is a Box Plot? 📂Data Science

What is a Box Plot?

Definition1

A box plot is a diagram that represents the median, first quartile, third quartile, maximum, and minimum of data as shown below.

Explanation

  • The third quartile, median, and first quartile are denoted as $Q3$, $Q2$, and $Q1$, respectively.
  • The difference between $Q3$ and $Q1$ is called the IQR.
  • The maximum and minimum values are denoted as $Q4$ and $Q0$, respectively.

The rectangle in the middle is called the box, and the lines above and below the box are called whiskers. Therefore, it is also referred to as a box-whisker plot. The values at the ends of the whiskers are not only the maximum and minimum values but are sometimes represented differently. Considering the data as $x$, the maximum and minimum values are replaced by $$ \begin{align*} Q4 &= \max( x \le \min(Q3 + 1.5\text{IQR}, \max(x)) \\ Q0 &= \min(\max(Q1 - 1.5\text{IQR}, \min(x)) \le x) \end{align*} $$ to consider outlier detection.

How to Draw

Let’s say we have the given data. $$ x = \begin{bmatrix}9 & 8 & 7 & 7 & 7 & 6 & 6 & 5 & 3 & 2\end{bmatrix}^{T} $$ Find the $Q3$, $Q2$, and $Q1$ of $x$. $$ \begin{align*} Q3 &= \text{third quartile} = 7.0 \\ Q2 &= \text{median} = 6.5 \\ Q1 &= \text{first quartile} = 5.25 \end{align*} $$ Mark the $Q3$, $Q2$, and $Q1$ as horizontal lines and draw the box.

Find the maximum value $Q4$ and the minimum value $Q0$ after removing the outliers from $x$.

$$ \min(Q3+1.5\text{IQR}, \max(x)) = \min(9.625, 9.0) = 9.0 \\ \implies Q4 = \max(x \le 9.0) = 9.0 $$

$$ \max(Q1-1.5\text{IQR}, \min(x)) = \max(2.625, 2.0) = 2.625 \\ \implies Q0 = \min( x \ge 2.625) = 3.0 $$

Mark the $Q4$ and $Q0$, and plot the outlier $\text{outliers} \notin [Q0, Q4]$ (here $2$) as dots.

Programming Languages

Characteristics

As described above, the area filled with the box represents the half of the data (from the bottom 75% to the top 25% of the values). It doesn’t provide as detailed information as a histogram does, but it gives an impression as if we are looking at a histogram from above vertically. Therefore, it has a greater advantage when drawing for multiple data sets rather than for a single data set. If you want to compare the distribution of four types of data using histograms, you would need four separate charts and it would be difficult to overlay them for comparison. However, with box plots, they can be easily compared on a single chart.