Dimensionality Reduction in Data Science 📂Data Science

Dimensionality Reduction in Data Science

Definition

Let’s assume a data set $X \subset \mathbb{R}^{n}$ is given. The following mapping for $m \lt n$ is called dimension reduction.

$$ r : X \to \mathbb{R}^{m} $$

Or more commonly in machine learning, any method that reduces the number of input variables in a way that retains as much of the performance as possible is called a dimension reduction technique.

Explanation

Dimension reduction, as the name suggests, refers to reducing the dimensionality of vectors. It is often used to make data easier and more intuitive to understand. The method of reduction varies by algorithm. It might involve outright deleting certain components or creating new, lower-dimensional data from the existing data according to predefined rules. The following are some of the techniques:

Principal Component Analysis
- Principal Component Analysis in Mathematical Statistics

Purpose

Visualization

It is practically impossible to efficiently visualize data with more than four dimensions. Even with three-dimensional data, depending on its form, there can be difficulties in visualization. Difficulties in visualization mean that it is challenging to draw pictures that well represent the features of the data. For three-dimensional data, the shape may appear differently depending on the viewpoint. In such cases, reducing the dimensionality for drawing can make it easier to grasp the features of the data. The picture below shows an example where the same data looks significantly different depending on the viewing direction. The right picture is a projection of the left data onto the $xy$-plane.

The Iris dataset, which is four-dimensional data, is often introduced in many data science textbooks as being visualized by splitting it into several two-dimensional figures like the following.

Focus and Selection

Dimension reduction can be used to discard less important information to focus more on the important information. “Less important information” here refers to noise or redundant information. For example, looking at the left table below, one can see that the first column has the same value for all data. Also, the second and third columns have different values but are essentially the same. Thus, dimension reduction can be done by discarding the first column and either the second or third column.

the right table summarizes weather information for Daegu. At first glance, it might seem like there is no unnecessary information, but since “daily temperature range = maximum temperature - minimum temperature,” these three are not linearly independent, and actually, errors can occur in regression analysis. Therefore, in this case, deleting the fourth column to remove multicollinearity is an example of dimension reduction.

School	Grade	Group	Name
Hive High School	3rd grade	fromis_9	LEE NA GYUNG
Hive High School	3rd grade	fromis_9	BAEK JI HEON
Hive High School	2nd grade	LE SSERAFIM	KIM CHAEWON
Hive High School	2nd grade	LE SSERAFIM	HUH YUNJIN
Hive High School	1st grade	NewJeans	HAERIN
Hive High School	1st grade	NewJeans	MINJI

Date	High Temp	Low Temp	Daily Temp Range	Precipitation Probability
19th	32º	24º	8º	60%
20th	33º	22º	11º	0%
21st	32º	23º	9º	30%
22nd	30º	21º	9º	60%
23rd	31º	24º	7º	60%
24th	33º	25º	8º	60%

Lightweighting

Reducing the dimensions of data means there are fewer numbers to store, thereby reducing the data’s storage size. In the case of artificial neural networks, MLPs consist of linear layers where the dimension of input data influences the number of model parameters. Dimension reduction can be used here to reduce the number of model parameters. Even in models like CNNs, where the input data’s dimensions do not affect the number of model parameters, reducing dimensions can still offer computational speed advantages.

Preventing Overfitting

Appropriate dimension reduction is known to be able to prevent overfitting to some extent.