Reading Large CSV Files in Python at Once
Code
y_test=[]
y_csv = open('y_test.csv', 'r', encoding='utf-8')
rdr = csv.reader(y_csv)
for line in rdr:
y_test.append(line[0])
y_csv.close()
Normally, when reading a csv file, one would open it using the built-in Python function open
and process it line by line, as shown above. However, this approach is not well-suited for handling big data as soon as loops are introduced. For instance, a file exceeding 700MB might not seem very large, but reading it line by line would be endless.
pandas package
For handling such data, it is convenient to use the pandas package.
Functions
- pandas.read_csv(filepath_or_buffer): Reads a .csv file provided in the filepath_or_buffer and returns it as a dataframe. [ NOTE : By setting header=None, files without headers can be read. If not specified, the first row is assumed to be the header, which might result in losing a piece of data. If the size seems off by exactly one, this scenario could be the reason. ]
- pandas.read_excel: This function allows reading *.xlsx files. The basic usage is similar to pandas.read_csv(filepath_or_buffer), and since Excel files have sheets, the sheet name option can specify which sheet to read.
Methods
- X.to_csv(path_or_buf): Outputs the dataframe X as a .csv file to path_or_buf.
- X.to_excel(path_or_buf): Outputs the dataframe X as a .xlsx file to path_or_buf.
- .head(): Returns the first few data points. Essential for checking if the data has been read correctly.
- .tail(): Returns the last few data points.
Properties
- .shape: Shows the shape of the data. If there are $n$ rows and $p$ columns, it returns (n, p).