Importing External Data in R
Overview
R is fundamentally born for statistics, which is why data input is conveniently designed.
read.table(file, header = FALSE, sep = "", na.strings = "NA", fileEncoding = "")
Function Introduction
read.table()
is a function used to import data tables, offering various useful options like those mentioned above. There are more options available, but the ones listed here are frequently used and essential to know. Below are the descriptions:
- (1)
header
: An option about the treatment of the first line. Settingheader=T
automatically excludes the first line from the data and assigns it as the column names. The default isheader=F
, which is rarely used for most data. - (2)
sep
: An option for the delimiter. Since a lot of data are saved in csv format (comma-separated values),read.csv()
is commonly used.read.csv()
hassep = ','
as its default, so there is no need to adjust it, resulting in slightly cleaner code. However, there are occasions when usingread.table()
or when dealing with data separated by tabs or slashes. If the data is imported with messy\t
(tabs) stuck on a column, then this is definitely the problem. - (3)
na.strings
: An option for missing values. If the data is not from a textbook example but sourced elsewhere, it is likely poorly organized with an abundance of missing values. A non-numeric character in a continuous variable is almost always an indicator of missing values, and not being able to handle them complicates data processing. Classifying these as missing values upon import allows them to be neatly arranged as NA, which is why this option is frequently used. Note thatna.strings=c("a","b","c")
can be used to handle multiple types of missing value representations by passing a vector of characters. - (4)
fileEncoding
: An option for the encoding of the imported file. Computer science majors deal with ‘data’ in a more general sense in C or Python, but those who learn statistics first might overlook this aspect. Truthfully, it’s not frequently used, but that doesn’t mean one shouldn’t know it. It can save from being blindsided by mistakes, so it’s good to know. When problems occur, simply checking the data file’s encoding and inputting it can smoothly solve the issue.
Error: EOF within quoted string
When importing big data, a warning message like above can appear due to the data’s large size. In such cases, simply adding quote=""
to the options easily solves the problem. However, a better solution is to use a better function. More often than not, the user is the problem, not the data. Remember, honestly, you are the latter.