Reading CSV data into pandas

Posted in Data Processing, Python on August 24, 2022 by tzelleke ‐ 12 min read

pandas.read_csv is the workhorse for reading structured textfiles into a DataFrame.

For the purpose of this tutorial we use the io.StringIO class to create sample data from inline strings. We provide the StringIO object as file-like input to pandas.read_csv. Therefore, we import the io module from Python’s standard library besides pandas.

import io
import pandas as pd

csv = io.StringIO('\n'.join([
    'id,name,position,dob,height,weight',
    '1,Russel,Quarterback,1988-11-29,"5 ft 11 in","215 lb"',
    '2,Patrick,Quarterback,1995-09-17,"6 ft 2 in","225 lb"',
    '3,Saquon,Running back,1997-02-09,"6 ft 0 in","232 lb"',
    '4,Micah,Linebacker,1999-05-26,"6 ft 3 in","245 lb"',
]))

pd.read_csv(csv)

pandas.DataFrame
	id	name	position	dob	height	weight
0	1	Russel	Quarterback	1988-11-29	5 ft 11 in	215 lb
1	2	Patrick	Quarterback	1995-09-17	6 ft 2 in	225 lb
2	3	Saquon	Running back	1997-02-09	6 ft 0 in	232 lb
3	4	Micah	Linebacker	1999-05-26	6 ft 3 in	245 lb

We can limit the number of rows that are read with the nrows option. This is useful with larger and/or unfamiliar datasets as it allows to have a peek at the data. For instance, to exclude certain columns from the import.

We also need to reset the StringIO object with csv.seek(0) before it can be read in again.

csv.seek(0)
pd.read_csv(
    csv,
    nrows=1,
)

pandas.DataFrame
	id	name	position	dob	height	weight
0	1	Russel	Quarterback	1988-11-29	5 ft 11 in	215 lb

sep
header

csv.seek(0)
pd.read_csv(csv)

pandas.DataFrame
	id	name	position	dob	height	weight
0	1	Russel	Quarterback	1988-11-29	5 ft 11 in	215 lb
1	2	Patrick	Quarterback	1995-09-17	6 ft 2 in	225 lb
2	3	Saquon	Running back	1997-02-09	6 ft 0 in	232 lb
3	4	Micah	Linebacker	1999-05-26	6 ft 3 in	245 lb

Load only required data

We can load only a subset of the data using the usecols option.

csv.seek(0)
pd.read_csv(
    csv,
    usecols=['id', 'name', 'dob'],
)

pandas.DataFrame
	id	name	dob
0	1	Russel	1988-11-29
1	2	Patrick	1995-09-17
2	3	Saquon	1997-02-09
3	4	Micah	1999-05-26

Set DataFrame index

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    usecols=['id', 'name', 'dob'],
)

pandas.DataFrame
	name	dob
id
1	Russel	1988-11-29
2	Patrick	1995-09-17
3	Saquon	1997-02-09
4	Micah	1999-05-26

Specify column datatype

With the dtype option we can specify the datatype of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(
        name='string',
        position='category',
    ),
)

pandas.DataFrame
	name	position	dob	height	weight
id
1	Russel	Quarterback	1988-11-29	5 ft 11 in	215 lb
2	Patrick	Quarterback	1995-09-17	6 ft 2 in	225 lb
3	Saquon	Running back	1997-02-09	6 ft 0 in	232 lb
4	Micah	Linebacker	1999-05-26	6 ft 3 in	245 lb

Date parsing

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    parse_dates=['dob'],
    date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
)

pandas.DataFrame
	name	position	dob	height	weight
id
1	Russel	Quarterback	1988-11-29 00:00:00	5 ft 11 in	215 lb
2	Patrick	Quarterback	1995-09-17 00:00:00	6 ft 2 in	225 lb
3	Saquon	Running back	1997-02-09 00:00:00	6 ft 0 in	232 lb
4	Micah	Linebacker	1999-05-26 00:00:00	6 ft 3 in	245 lb

Set categorical data

With the index_col option we can specify a column to populate the index of the DataFrame.

When used in combination with the usecols option the index column must be included in the subset of columns.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(position='category'),
)

pandas.DataFrame
	name	position	dob	height	weight
id
1	Russel	Quarterback	1988-11-29	5 ft 11 in	215 lb
2	Patrick	Quarterback	1995-09-17	6 ft 2 in	225 lb
3	Saquon	Running back	1997-02-09	6 ft 0 in	232 lb
4	Micah	Linebacker	1999-05-26	6 ft 3 in	245 lb

Customized data parsing

With the converters option we can individually customize data parsing for each column.

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    converters=dict(
        height=lambda x: [int(_) for _ in x.strip().split(' ')[::2]],
        weight=lambda x: round(
            .453592 * int(x.strip().split(' ')[0])
        ),
    ),
)

pandas.DataFrame
	name	position	dob	height	weight
id
1	Russel	Quarterback	1988-11-29	[5, 11]	98
2	Patrick	Quarterback	1995-09-17	[6, 2]	102
3	Saquon	Running back	1997-02-09	[6, 0]	105
4	Micah	Linebacker	1999-05-26	[6, 3]	111

Complete example

csv.seek(0)
pd.read_csv(
    csv,
    index_col='id',
    dtype=dict(
        name='string',
        position='category',
    ),
    parse_dates=['dob'],
    date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
    converters=dict(
        height=lambda x: x,
        weight=lambda x: x,
    ),
)

pandas.DataFrame
	name	position	dob	height	weight
id
1	Russel	Quarterback	1988-11-29 00:00:00	5 ft 11 in	215 lb
2	Patrick	Quarterback	1995-09-17 00:00:00	6 ft 2 in	225 lb
3	Saquon	Running back	1997-02-09 00:00:00	6 ft 0 in	232 lb
4	Micah	Linebacker	1999-05-26 00:00:00	6 ft 3 in	245 lb

pandas CSV

Reading CSV data into pandas

Load only required data #

Set DataFrame index #

Specify column datatype #

Date parsing #

Set categorical data #

Customized data parsing #

Complete example #

Related posts

Text processing in pandas

pydash - a Swiss Army knife for Python

Load only required data

Set DataFrame index

Specify column datatype

Date parsing

Set categorical data

Customized data parsing

Complete example