logo

What is a Box Plot? 📂Data Science

What is a Box Plot?

Definition1

A box plot is a diagram that represents the median, first quartile, third quartile, maximum, and minimum of data as shown below.

Explanation

  • The third quartile, median, and first quartile are denoted as Q3Q3, Q2Q2, and Q1Q1, respectively.
  • The difference between Q3Q3 and Q1Q1 is called the IQR.
  • The maximum and minimum values are denoted as Q4Q4 and Q0Q0, respectively.

The rectangle in the middle is called the box, and the lines above and below the box are called whiskers. Therefore, it is also referred to as a box-whisker plot. The values at the ends of the whiskers are not only the maximum and minimum values but are sometimes represented differently. Considering the data as xx, the maximum and minimum values are replaced by Q4=max(xmin(Q3+1.5IQR,max(x))Q0=min(max(Q11.5IQR,min(x))x) \begin{align*} Q4 &= \max( x \le \min(Q3 + 1.5\text{IQR}, \max(x)) \\ Q0 &= \min(\max(Q1 - 1.5\text{IQR}, \min(x)) \le x) \end{align*} to consider outlier detection.

How to Draw

Let’s say we have the given data. x=[9877766532]T x = \begin{bmatrix}9 & 8 & 7 & 7 & 7 & 6 & 6 & 5 & 3 & 2\end{bmatrix}^{T} Find the Q3Q3, Q2Q2, and Q1Q1 of xx. Q3=third quartile=7.0Q2=median=6.5Q1=first quartile=5.25 \begin{align*} Q3 &= \text{third quartile} = 7.0 \\ Q2 &= \text{median} = 6.5 \\ Q1 &= \text{first quartile} = 5.25 \end{align*} Mark the Q3Q3, Q2Q2, and Q1Q1 as horizontal lines and draw the box.

Find the maximum value Q4Q4 and the minimum value Q0Q0 after removing the outliers from xx.

min(Q3+1.5IQR,max(x))=min(9.625,9.0)=9.0    Q4=max(x9.0)=9.0 \min(Q3+1.5\text{IQR}, \max(x)) = \min(9.625, 9.0) = 9.0 \\ \implies Q4 = \max(x \le 9.0) = 9.0

max(Q11.5IQR,min(x))=max(2.625,2.0)=2.625    Q0=min(x2.625)=3.0 \max(Q1-1.5\text{IQR}, \min(x)) = \max(2.625, 2.0) = 2.625 \\ \implies Q0 = \min( x \ge 2.625) = 3.0

Mark the Q4Q4 and Q0Q0, and plot the outlier outliers[Q0,Q4]\text{outliers} \notin [Q0, Q4] (here 22) as dots.

Programming Languages

Characteristics

As described above, the area filled with the box represents the half of the data (from the bottom 75% to the top 25% of the values). It doesn’t provide as detailed information as a histogram does, but it gives an impression as if we are looking at a histogram from above vertically. Therefore, it has a greater advantage when drawing for multiple data sets rather than for a single data set. If you want to compare the distribution of four types of data using histograms, you would need four separate charts and it would be difficult to overlay them for comparison. However, with box plots, they can be easily compared on a single chart.