Converting Categorical Data to Numeric Data in R 📂R

Converting Categorical Data to Numeric Data in R

Overview

This post is for those who are unable to convert categorical data, which is read as numerical, into continuous data despite their intentions. The majority of this post is dedicated to explaining the principle, so if you only need the conclusion, it is recommended to start reading from the Practical Example section below.

Note that the term “Cast” is commonly used when changing data types.

Principle

When conducting statistical analysis using R, understanding the data itself is more crucial than the technique.

$20180327\_122310.png$

This is how it looks after loading the Iris data, checking the data, and confirming its str ucture. By using the str() function, it briefly explains what form each column of data is in and what examples there are. Although it’s not as visually appealing as head(), it’s often used because it’s easier to grasp the data types as the number of variables increase.

$20180327\_124419.png$

Hence, the need for is.X() and class() functions arises. is.X() returns a boolean indicating whether the given argument is X or not, and if there are multiple arguments, it judges whether all of them are X. The class() function returns the specific data type as a string.

$20180327\_124802.png$

On the other hand, the as.X() function converts the given argument into X as far as possible. Similarly, it seems like converting numeric categorical data with as.numeric() would suffice, but it’s necessary to understand categorical data first.

Practical Example

Consider the following example:

$20180327\_130132.png$

Notice that x1 was not converted correctly, whereas x2 was converted exactly as intended. The principle is as follows.

As you can see, categorized data has no meaning as numbers and only belongs to one of the divided classes. Therefore, even if converted with as.numeric(), it’s based on the category, making a one-to-one correspondence with natural numbers. To correspond to a category like $\left\{ a, 2, c , \cdots , z_{2} \right\}$ , this method is most appropriate.

To retain the given numbers as they are, it’s necessary to first convert them into strings and then into numbers. When converting categories made of numbers into strings, there is no loss of information as it doesn’t correspond to natural numbers as is, and when converting numerical strings back to numbers, there’s no need to generate new numbers, so there is no loss of information.

Understanding functions and data types allows you to use various handy tricks in data handling. It’s a valuable weapon that shouldn’t be underestimated just because it’s not directly related to statistics.

Code

Below is an example code.

example<-iris
head(example)
str(example)
 
is.numeric(example$Sepal.Length)
is.character(example$Sepal.Length)
class(example$Sepal.Length)
 
is.numeric(example$Species)
is.character(example$Species)
class(example$Species)
 
is.numeric(example)
is.character(example)
class(example)
 
one<-'1'
is.numeric(one)
is.character(one)
class(one)
 
one<-as.numeric(one)
class(one)
is.numeric(one)
is.character(one)
class(one)
 
x<-as.factor(c(1,2,3,128,67915)); x
x1<-as.numeric(x); x1
is.numeric(x1)
x2<-as.numeric(as.character(x)); x2
is.numeric(x2)