Reading CSV data into pandas
Posted in Data Processing, Python on August 24, 2022 by tzelleke ‐ 12 min read
pandas.read_csv is the workhorse for reading structured textfiles into a DataFrame.
For the purpose of this tutorial we use the io.StringIO class to create sample data from inline strings.
We provide the StringIO
object as file-like input to pandas.read_csv
.
Therefore, we import the io
module from Python’s standard library besides pandas
.
import io
import pandas as pd
csv = io.StringIO('\n'.join([
'id,name,position,dob,height,weight',
'1,Russel,Quarterback,1988-11-29,"5 ft 11 in","215 lb"',
'2,Patrick,Quarterback,1995-09-17,"6 ft 2 in","225 lb"',
'3,Saquon,Running back,1997-02-09,"6 ft 0 in","232 lb"',
'4,Micah,Linebacker,1999-05-26,"6 ft 3 in","245 lb"',
]))
pd.read_csv(csv)
id | name | position | dob | height | weight | |
---|---|---|---|---|---|---|
0 | 1 | Russel | Quarterback | 1988-11-29 | 5 ft 11 in | 215 lb |
1 | 2 | Patrick | Quarterback | 1995-09-17 | 6 ft 2 in | 225 lb |
2 | 3 | Saquon | Running back | 1997-02-09 | 6 ft 0 in | 232 lb |
3 | 4 | Micah | Linebacker | 1999-05-26 | 6 ft 3 in | 245 lb |
We can limit the number of rows that are read with the nrows
option.
This is useful with larger and/or unfamiliar datasets as it allows to have a peek at the data.
For instance, to exclude certain columns from the import.
We also need to reset the StringIO
object with csv.seek(0)
before it can be read in again.
csv.seek(0)
pd.read_csv(
csv,
nrows=1,
)
id | name | position | dob | height | weight | |
---|---|---|---|---|---|---|
0 | 1 | Russel | Quarterback | 1988-11-29 | 5 ft 11 in | 215 lb |
- sep
- header
csv.seek(0)
pd.read_csv(csv)
id | name | position | dob | height | weight | |
---|---|---|---|---|---|---|
0 | 1 | Russel | Quarterback | 1988-11-29 | 5 ft 11 in | 215 lb |
1 | 2 | Patrick | Quarterback | 1995-09-17 | 6 ft 2 in | 225 lb |
2 | 3 | Saquon | Running back | 1997-02-09 | 6 ft 0 in | 232 lb |
3 | 4 | Micah | Linebacker | 1999-05-26 | 6 ft 3 in | 245 lb |
Load only required data
We can load only a subset of the data using the usecols
option.
csv.seek(0)
pd.read_csv(
csv,
usecols=['id', 'name', 'dob'],
)
id | name | dob | |
---|---|---|---|
0 | 1 | Russel | 1988-11-29 |
1 | 2 | Patrick | 1995-09-17 |
2 | 3 | Saquon | 1997-02-09 |
3 | 4 | Micah | 1999-05-26 |
Set DataFrame index
With the index_col
option we can specify a column to populate the index of the DataFrame.
When used in combination with the usecols
option the index column must be included in the subset of columns.
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
usecols=['id', 'name', 'dob'],
)
name | dob | |
---|---|---|
id | ||
1 | Russel | 1988-11-29 |
2 | Patrick | 1995-09-17 |
3 | Saquon | 1997-02-09 |
4 | Micah | 1999-05-26 |
Specify column datatype
With the dtype
option we can specify the datatype of columns.
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
dtype=dict(
name='string',
position='category',
),
)
name | position | dob | height | weight | |
---|---|---|---|---|---|
id | |||||
1 | Russel | Quarterback | 1988-11-29 | 5 ft 11 in | 215 lb |
2 | Patrick | Quarterback | 1995-09-17 | 6 ft 2 in | 225 lb |
3 | Saquon | Running back | 1997-02-09 | 6 ft 0 in | 232 lb |
4 | Micah | Linebacker | 1999-05-26 | 6 ft 3 in | 245 lb |
Date parsing
With the index_col
option we can specify a column to populate the index of the DataFrame.
When used in combination with the usecols
option the index column must be included in the subset of columns.
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
parse_dates=['dob'],
date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
)
name | position | dob | height | weight | |
---|---|---|---|---|---|
id | |||||
1 | Russel | Quarterback | 1988-11-29 00:00:00 | 5 ft 11 in | 215 lb |
2 | Patrick | Quarterback | 1995-09-17 00:00:00 | 6 ft 2 in | 225 lb |
3 | Saquon | Running back | 1997-02-09 00:00:00 | 6 ft 0 in | 232 lb |
4 | Micah | Linebacker | 1999-05-26 00:00:00 | 6 ft 3 in | 245 lb |
Set categorical data
With the index_col
option we can specify a column to populate the index of the DataFrame.
When used in combination with the usecols
option the index column must be included in the subset of columns.
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
dtype=dict(position='category'),
)
name | position | dob | height | weight | |
---|---|---|---|---|---|
id | |||||
1 | Russel | Quarterback | 1988-11-29 | 5 ft 11 in | 215 lb |
2 | Patrick | Quarterback | 1995-09-17 | 6 ft 2 in | 225 lb |
3 | Saquon | Running back | 1997-02-09 | 6 ft 0 in | 232 lb |
4 | Micah | Linebacker | 1999-05-26 | 6 ft 3 in | 245 lb |
Customized data parsing
With the converters
option we can individually customize data parsing for each column.
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
converters=dict(
height=lambda x: [int(_) for _ in x.strip().split(' ')[::2]],
weight=lambda x: round(
.453592 * int(x.strip().split(' ')[0])
),
),
)
name | position | dob | height | weight | |
---|---|---|---|---|---|
id | |||||
1 | Russel | Quarterback | 1988-11-29 | [5, 11] | 98 |
2 | Patrick | Quarterback | 1995-09-17 | [6, 2] | 102 |
3 | Saquon | Running back | 1997-02-09 | [6, 0] | 105 |
4 | Micah | Linebacker | 1999-05-26 | [6, 3] | 111 |
Complete example
csv.seek(0)
pd.read_csv(
csv,
index_col='id',
dtype=dict(
name='string',
position='category',
),
parse_dates=['dob'],
date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'),
converters=dict(
height=lambda x: x,
weight=lambda x: x,
),
)
name | position | dob | height | weight | |
---|---|---|---|---|---|
id | |||||
1 | Russel | Quarterback | 1988-11-29 00:00:00 | 5 ft 11 in | 215 lb |
2 | Patrick | Quarterback | 1995-09-17 00:00:00 | 6 ft 2 in | 225 lb |
3 | Saquon | Running back | 1997-02-09 00:00:00 | 6 ft 0 in | 232 lb |
4 | Micah | Linebacker | 1999-05-26 00:00:00 | 6 ft 3 in | 245 lb |