logo

Dimensionality Reduction in Data Science 📂Data Science

Dimensionality Reduction in Data Science

Definition

Let’s assume a data set $X \subset \mathbb{R}^{n}$ is given. The following mapping for $m \lt n$ is called dimension reduction.

$$ r : X \to \mathbb{R}^{m} $$

Or more commonly in machine learning, any method that reduces the number of input variables in a way that retains as much of the performance as possible is called a dimension reduction technique.

Explanation

Dimension reduction, as the name suggests, refers to reducing the dimensionality of vectors. It is often used to make data easier and more intuitive to understand. The method of reduction varies by algorithm. It might involve outright deleting certain components or creating new, lower-dimensional data from the existing data according to predefined rules. The following are some of the techniques:

Purpose

Visualization

It is practically impossible to efficiently visualize data with more than four dimensions. Even with three-dimensional data, depending on its form, there can be difficulties in visualization. Difficulties in visualization mean that it is challenging to draw pictures that well represent the features of the data. For three-dimensional data, the shape may appear differently depending on the viewpoint. In such cases, reducing the dimensionality for drawing can make it easier to grasp the features of the data. The picture below shows an example where the same data looks significantly different depending on the viewing direction. The right picture is a projection of the left data onto the $xy$-plane.

The Iris dataset, which is four-dimensional data, is often introduced in many data science textbooks as being visualized by splitting it into several two-dimensional figures like the following.

Focus and Selection

Dimension reduction can be used to discard less important information to focus more on the important information. “Less important information” here refers to noise or redundant information. For example, looking at the left table below, one can see that the first column has the same value for all data. Also, the second and third columns have different values but are essentially the same. Thus, dimension reduction can be done by discarding the first column and either the second or third column. Furthermore, the right table summarizes weather information for Daegu. At first glance, it might seem like there is no unnecessary information, but since “daily temperature range = maximum temperature - minimum temperature,” these three are not linearly independent, and actually, errors can occur in regression analysis. Therefore, in this case, deleting the fourth column to remove multicollinearity is an example of dimension reduction.

SchoolGradeGroupName
Hive High School3rd gradefromis_9LEE NA GYUNG
Hive High School3rd gradefromis_9BAEK JI HEON
Hive High School2nd gradeLE SSERAFIMKIM CHAEWON
Hive High School2nd gradeLE SSERAFIMHUH YUNJIN
Hive High School1st gradeNewJeansHAERIN
Hive High School1st gradeNewJeansMINJI

DateHigh TempLow TempDaily
Temp Range
Precipitation
Probability
19th32º24º60%
20th33º22º11º0%
21st32º23º30%
22nd30º21º60%
23rd31º24º60%
24th33º25º60%

Lightweighting

Reducing the dimensions of data means there are fewer numbers to store, thereby reducing the data’s storage size. In the case of artificial neural networks, MLPs consist of linear layers where the dimension of input data influences the number of model parameters. Dimension reduction can be used here to reduce the number of model parameters. Even in models like CNNs, where the input data’s dimensions do not affect the number of model parameters, reducing dimensions can still offer computational speed advantages.

Preventing Overfitting

Appropriate dimension reduction is known to be able to prevent overfitting to some extent.