Dimensionality Reduction in Data Science
Definition
Let’s assume a data set $X \subset \mathbb{R}^{n}$ is given. The following mapping for $m \lt n$ is called dimension reduction.
$$ r : X \to \mathbb{R}^{m} $$
Or more commonly in machine learning, any method that reduces the number of input variables in a way that retains as much of the performance as possible is called a dimension reduction technique.
Explanation
Dimension reduction, as the name suggests, refers to reducing the dimensionality of vectors. It is often used to make data easier and more intuitive to understand. The method of reduction varies by algorithm. It might involve outright deleting certain components or creating new, lower-dimensional data from the existing data according to predefined rules. The following are some of the techniques:
Purpose
Visualization
It is practically impossible to efficiently visualize data with more than four dimensions. Even with three-dimensional data, depending on its form, there can be difficulties in visualization. Difficulties in visualization mean that it is challenging to draw pictures that well represent the features of the data. For three-dimensional data, the shape may appear differently depending on the viewpoint. In such cases, reducing the dimensionality for drawing can make it easier to grasp the features of the data. The picture below shows an example where the same data looks significantly different depending on the viewing direction. The right picture is a projection of the left data onto the $xy$-plane.
The Iris dataset, which is four-dimensional data, is often introduced in many data science textbooks as being visualized by splitting it into several two-dimensional figures like the following.
Focus and Selection
Dimension reduction can be used to discard less important information to focus more on the important information. “Less important information” here refers to noise or redundant information. For example, looking at the left table below, one can see that the first column has the same value for all data. Also, the second and third columns have different values but are essentially the same. Thus, dimension reduction can be done by discarding the first column and either the second or third column. Furthermore, the right table summarizes weather information for Daegu. At first glance, it might seem like there is no unnecessary information, but since “daily temperature range = maximum temperature - minimum temperature,” these three are not linearly independent, and actually, errors can occur in regression analysis. Therefore, in this case, deleting the fourth column to remove multicollinearity is an example of dimension reduction.
School | Grade | Group | Name |
---|---|---|---|
Hive High School | 3rd grade | fromis_9 | LEE NA GYUNG |
Hive High School | 3rd grade | fromis_9 | BAEK JI HEON |
Hive High School | 2nd grade | LE SSERAFIM | KIM CHAEWON |
Hive High School | 2nd grade | LE SSERAFIM | HUH YUNJIN |
Hive High School | 1st grade | NewJeans | HAERIN |
Hive High School | 1st grade | NewJeans | MINJI |
Date | High Temp | Low Temp | Daily Temp Range | Precipitation Probability |
---|---|---|---|---|
19th | 32º | 24º | 8º | 60% |
20th | 33º | 22º | 11º | 0% |
21st | 32º | 23º | 9º | 30% |
22nd | 30º | 21º | 9º | 60% |
23rd | 31º | 24º | 7º | 60% |
24th | 33º | 25º | 8º | 60% |
Lightweighting
Reducing the dimensions of data means there are fewer numbers to store, thereby reducing the data’s storage size. In the case of artificial neural networks, MLPs consist of linear layers where the dimension of input data influences the number of model parameters. Dimension reduction can be used here to reduce the number of model parameters. Even in models like CNNs, where the input data’s dimensions do not affect the number of model parameters, reducing dimensions can still offer computational speed advantages.
Preventing Overfitting
Appropriate dimension reduction is known to be able to prevent overfitting to some extent.