Reading CSV data into pandas

Posted in Data Processing, Python on August 24, 2022 by tzelleke ‐ 12 min read

pandas.read_csv is the workhorse for reading structured textfiles into a DataFrame.

For the purpose of this tutorial we use the io.StringIO class to create sample data from inline strings. We provide the StringIO object as file-like input to pandas.read_csv. Therefore, we import the io module from Python’s standard library besides pandas.

import io
import pandas as pd

csv = io.StringIO('\n'.join([
    'id,name,position,dob,height,weight',
    '1,Russel,Quarterback,1988-11-29,"5 ft 11 in","215 lb"',
    '2,Patrick,Quarterback,1995-09-17,"6 ft 2 in","225 lb"',
    '3,Saquon,Running back,1997-02-09,"6 ft 0 in","232 lb"',
    '4,Micah,Linebacker,1999-05-26,"6 ft 3 in","245 lb"',
]))

pd.read_csv(csv)
pandas.DataFrame
 idnamepositiondobheightweight
01RusselQuarterback1988-11-295 ft 11 in215 lb
12PatrickQuarterback1995-09-176 ft 2 in225 lb
23SaquonRunning back1997-02-096 ft 0 in232 lb
34MicahLinebacker1999-05-266 ft 3 in245 lb

We can limit the number of rows that are read with the nrows option. This is useful with larger and/or unfamiliar datasets as it allows to have a peek at the data. For instance, to exclude certain columns from the import.

We also need to reset the StringIO object with csv.seek(0) before it can be read in again.

csv.seek(0)
pd.read_csv(
    csv,
    nrows=1,
)
pandas.DataFrame
 idnamepositiondobheightweight
01RusselQuarterback1988-11-295 ft 11 in215 lb
  • sep
  • header
csv.seek(0)
pd.read_csv(csv)
pandas.DataFrame
 idnamepositiondobheightweight
01RusselQuarterback1988-11-295 ft 11 in215 lb
12PatrickQuarterback1995-09-176 ft 2 in225 lb
23SaquonRunning back1997-02-096 ft 0 in232 lb
34MicahLinebacker1999-05-266 ft 3 in245 lb

Load only required data

We can load only a subset of the data using the usecols option.

csv.seek(0)
pd.read_csv(
    csv,
    usecols=['id', 'name', 'dob'],
)
pandas.DataFrame
 idnamedob
01Russel1988-11-29
12Patrick1995-09-17
23Saquon1997-02-09
34Micah1999-05-26

Set DataFrame index

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    usecols=['id', 'name', 'dob'],
)
pandas.DataFrame
 namedob
id  
1Russel1988-11-29
2Patrick1995-09-17
3Saquon1997-02-09
4Micah1999-05-26

Specify column datatype

With the dtype option we can specify the datatype of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(
        name='string',
        position='category',
    ),
)
pandas.DataFrame
 namepositiondobheightweight
id     
1RusselQuarterback1988-11-295 ft 11 in215 lb
2PatrickQuarterback1995-09-176 ft 2 in225 lb
3SaquonRunning back1997-02-096 ft 0 in232 lb
4MicahLinebacker1999-05-266 ft 3 in245 lb

Date parsing

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    parse_dates=['dob'],
    date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
)
pandas.DataFrame
 namepositiondobheightweight
id     
1RusselQuarterback1988-11-29 00:00:005 ft 11 in215 lb
2PatrickQuarterback1995-09-17 00:00:006 ft 2 in225 lb
3SaquonRunning back1997-02-09 00:00:006 ft 0 in232 lb
4MicahLinebacker1999-05-26 00:00:006 ft 3 in245 lb

Set categorical data

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(position='category'),
)
pandas.DataFrame
 namepositiondobheightweight
id     
1RusselQuarterback1988-11-295 ft 11 in215 lb
2PatrickQuarterback1995-09-176 ft 2 in225 lb
3SaquonRunning back1997-02-096 ft 0 in232 lb
4MicahLinebacker1999-05-266 ft 3 in245 lb

Customized data parsing

With the converters option we can individually customize data parsing for each column.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    converters=dict(
        height=lambda x: [int(_) for _ in x.strip().split(' ')[::2]],
        weight=lambda x: round(
            .453592 * int(x.strip().split(' ')[0])
        ),
    ),
)
pandas.DataFrame
 namepositiondobheightweight
id     
1RusselQuarterback1988-11-29[5, 11]98
2PatrickQuarterback1995-09-17[6, 2]102
3SaquonRunning back1997-02-09[6, 0]105
4MicahLinebacker1999-05-26[6, 3]111

Complete example

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(
        name='string',
        position='category',
    ),
    parse_dates=['dob'],
    date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
    converters=dict(
        height=lambda x: x,
        weight=lambda x: x,
    ),
)
pandas.DataFrame
 namepositiondobheightweight
id     
1RusselQuarterback1988-11-29 00:00:005 ft 11 in215 lb
2PatrickQuarterback1995-09-17 00:00:006 ft 2 in225 lb
3SaquonRunning back1997-02-09 00:00:006 ft 0 in232 lb
4MicahLinebacker1999-05-26 00:00:006 ft 3 in245 lb