Text processing in pandas
Posted in Data Processing, Python on August 19, 2022 by tzelleke ‐ 15 min read
Text processing functionality in pandas is located under the Series.str
namespace.
Let us create some dummy data and load it into a DataFrame.
We import io
besides pandas
and use io.StringIO to provide file-like input to pandas.read_csv.
We specify that the id
column should populate the DataFrame’s index.
Additionally, we explicitely specify string
datatype for name
and phones
columns.
import io
import pandas as pd
csv = io.StringIO('\n'.join([
'id,name,phones',
'1,Russel,"01234567\n02345678\n03456789"',
'2,Patrick,"04123456\n05234567\n06345678"',
]))
df = pd.read_csv(
csv,
index_col='id',
dtype={
'name': 'string',
'phones': 'string',
},
)
df
name | phones | |
---|---|---|
id | ||
1 | Russel | 01234567 02345678 03456789 |
2 | Patrick | 04123456 05234567 06345678 |
Split into rows
Entries in the phones
column contain strings with multiple phone numbers separated by line break.
To make individual phone numbers accessible for further processing/analysis we need to extract phone numbers into individual rows of the DataFrame.
We can do that in two steps in pandas:
- Split each entry in the
phones
column into a list of strings usingSeries.str.split
. - Distribute list elements to individual rows using
Series.explode
.
Series.str.split splits each string in a Series on a given separator and returns a Series of lists.
Series.explode transforms list items in a Series of lists (or list-like objects) into separate entries with index labels replicated.
When we perform these steps on the phones
column we obtain a Series of individual phone numbers with index labels carried over from source entries.
df.phones.str.split('\n').explode().astype('string')
id | |
---|---|
1 | 01234567 |
1 | 02345678 |
1 | 03456789 |
2 | 04123456 |
2 | 05234567 |
2 | 06345678 |
At this point we can join the resulting Series to our initial DataFrame. Instead, we can arrive at the same result in a single expression.
df.assign(
phones=df.phones.str.split('\n')
).explode('phones').astype({'phones': 'string'})
name | phones | |
---|---|---|
id | ||
1 | Russel | 01234567 |
1 | Russel | 02345678 |
1 | Russel | 03456789 |
2 | Patrick | 04123456 |
2 | Patrick | 05234567 |
2 | Patrick | 06345678 |
First, we use DataFrame.assign to overwrite the initial phones
column with the Series returned from the split operation.
Then we call explode('phones')
on the resulting DataFrame.
DataFrame.explode works like Series.explode
yet expects a column to operate on. Lastly, we change the datatype of the new phones
column to string
.
After this operation the index id
is no longer unique.
Split into columns
Now we are going to look at the case where the phones
column contains additional information about the type of a phone number.
csv = io.StringIO('\n'.join([
'id,name,phones',
'1,Russel,"mobil: 01234567\nwork: 02345678"',
'2,Patrick,"mobil: 04123456\nprivate: 05234567"',
]))
df = pd.read_csv(
csv,
index_col='id',
dtype={'name': 'string', 'phones': 'string'},
)
df
name | phones | |
---|---|---|
id | ||
1 | Russel | mobil: 01234567 work: 02345678 |
2 | Patrick | mobil: 04123456 private: 05234567 |
As in the previous case we start out by splitting on linebreak and exploding to individual entries.
df.phones.str.split('\n').explode().astype('string')
id | |
---|---|
1 | mobil: 01234567 |
1 | work: 02345678 |
2 | mobil: 04123456 |
2 | private: 05234567 |
We need to append a second split operation, this time on the ‘:
In contrast to the previous split we like the resulting fragments expanded into separate columns in the resulting DataFrame.
(
df.phones
.str.split('\n').explode()
.str.split(': ', expand=True)
.astype('string')
.rename(columns={
0: 'phone_type',
1: 'phone_number',
})
)
phone_type | phone_number | |
---|---|---|
id | ||
1 | mobil | 01234567 |
1 | work | 02345678 |
2 | mobil | 04123456 |
2 | private | 05234567 |
When using expand=True
, the split elements are expanded into separate columns.
Thus, the resulting object is a DataFrame.
If NaN
is present, it is propagated throughout the columns during the split.
So far, we have transformed the phones
Series into a DataFrame with columns phone_type
and phone_number
.
In the following code snippet we combine the processing of the phones
column with a join to the initial DataFrame in one expression.
Lastly, we drop the then redundant phones
column from the resulting DataFrame.
df.join(
df.phones
.str.split('\n').explode()
.str.split(': ', expand=True)
.rename(columns={
0: 'phone_type',
1: 'phone_number',
})
.astype({
'phone_type': 'category',
'phone_number': 'string',
})
).drop('phones', axis='columns')
name | phone_type | phone_number | |
---|---|---|---|
id | |||
1 | Russel | mobil | 01234567 |
1 | Russel | work | 02345678 |
2 | Patrick | mobil | 04123456 |
2 | Patrick | private | 05234567 |
Reshaping to wide format
We can reshape the resulting DataFrame to wide format using the following idiom:
DataFrame.set_index('COLUMN', append=True).unstack('COLUMN')
DataFrame.set_index sets the DataFrame index from existing columns.
By specifying append=True
the column is added to the existing index instead of replacing it.
Thus, the returned DataFrame contains a hierarchical index composed of the initial index extended by the specified column.
DataFrame.unstack moves a level from the hierarchical (row) index into a level of the column index.
(
df.phones
.str.split('\n').explode()
.str.split(': ', expand=True)
.astype('string')
.rename(columns={
0: 'phone_type',
1: 'phone_number',
})
.set_index('phone_type', append=True)
.unstack('phone_type')
.droplevel(0, axis='columns')
)
phone_type | mobil | private | work |
---|---|---|---|
id | |||
1 | 01234567 | <NA> | 02345678 |
2 | 04123456 | 05234567 | <NA> |
Now, we can join the reshaped DataFrame to the main data and drop the now redundant phones
column.
df.join(
df.phones
.str.split('\n').explode()
.str.split(': ', expand=True)
.astype('string')
.rename(columns={
0: 'phone_type',
1: 'phone_number',
})
.set_index(['phone_type'], append=True)
.unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')
name | mobil | private | work | |
---|---|---|---|---|
id | ||||
1 | Russel | 01234567 | <NA> | 02345678 |
2 | Patrick | 04123456 | 05234567 | <NA> |
Data cleaning using regular expressions
We can arrive at the same result as in the previous case with fewer lines of code by leveraging regular expressions in combination with Series.str.extract.
df.join(
df.phones
.str.split('\n').explode().astype('string')
.str.extract(r'^(?P<phone_type>\w+):\s(?P<phone_number>\d+)$')
.set_index(['phone_type'], append=True)
.unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')
name | mobil | private | work | |
---|---|---|---|---|
id | ||||
1 | Russel | 01234567 | <NA> | 02345678 |
2 | Patrick | 04123456 | 05234567 | <NA> |
Let us review the regular expression in detail.
Syntax element | Interpretation |
---|---|
^ | anchors start of string |
(?P<phone_type>\w+) | capture group labeled “phone_type” |
:\s | matches separator, not captured |
(?P<phone_number>\d+) | capture group labeled “phone_number” |
$ | anchors end of string |
Named regular expression capture groups have the following structure in Python:(?P<NAME>PATTERN)
.