Text processing in pandas

Posted in Data Processing, Python on August 19, 2022 by tzelleke ‐ 15 min read

Text processing functionality in pandas is located under the Series.str namespace.

Let us create some dummy data and load it into a DataFrame.

We import io besides pandas and use io.StringIO to provide file-like input to pandas.read_csv.

We specify that the id column should populate the DataFrame’s index. Additionally, we explicitely specify string datatype for name and phones columns.

import io
import pandas as pd

csv = io.StringIO('\n'.join([
    'id,name,phones',
    '1,Russel,"01234567\n02345678\n03456789"',
    '2,Patrick,"04123456\n05234567\n06345678"',
]))

df = pd.read_csv(
    csv,
    index_col='id',
    dtype={
        'name': 'string',
        'phones': 'string',
    },
)
df

pandas.DataFrame
	name	phones
id
1	Russel	01234567 02345678 03456789
2	Patrick	04123456 05234567 06345678

Split into rows

Entries in the phones column contain strings with multiple phone numbers separated by line break. To make individual phone numbers accessible for further processing/analysis we need to extract phone numbers into individual rows of the DataFrame. We can do that in two steps in pandas:

Split each entry in the phones column into a list of strings using Series.str.split.
Distribute list elements to individual rows using Series.explode.

The diagram visualizes the two-step process of extracting individual text lines from multiline text cells in pandas. — Two-step process splitting multiline strings into individual rows in pandas.

Series.str.split splits each string in a Series on a given separator and returns a Series of lists.

Series.explode transforms list items in a Series of lists (or list-like objects) into separate entries with index labels replicated.

When we perform these steps on the phones column we obtain a Series of individual phone numbers with index labels carried over from source entries.

df.phones.str.split('\n').explode().astype('string')

pandas.Series
id
1	01234567
1	02345678
1	03456789
2	04123456
2	05234567
2	06345678

At this point we can join the resulting Series to our initial DataFrame. Instead, we can arrive at the same result in a single expression.

df.assign(
    phones=df.phones.str.split('\n')
).explode('phones').astype({'phones': 'string'})

pandas.DataFrame
	name	phones
id
1	Russel	01234567
1	Russel	02345678
1	Russel	03456789
2	Patrick	04123456
2	Patrick	05234567
2	Patrick	06345678

First, we use DataFrame.assign to overwrite the initial phones column with the Series returned from the split operation.

Then we call explode('phones') on the resulting DataFrame. DataFrame.explode works like Series.explode yet expects a column to operate on. Lastly, we change the datatype of the new phones column to string.

After this operation the index id is no longer unique.

Split into columns

Now we are going to look at the case where the phones column contains additional information about the type of a phone number.

csv = io.StringIO('\n'.join([
    'id,name,phones',
    '1,Russel,"mobil: 01234567\nwork: 02345678"',
    '2,Patrick,"mobil: 04123456\nprivate: 05234567"',
]))

df = pd.read_csv(
    csv,
    index_col='id',
    dtype={'name': 'string', 'phones': 'string'},
)
df

pandas.DataFrame
	name	phones
id
1	Russel	mobil: 01234567 work: 02345678
2	Patrick	mobil: 04123456 private: 05234567

As in the previous case we start out by splitting on linebreak and exploding to individual entries.

df.phones.str.split('\n').explode().astype('string')

pandas.Series
id
1	mobil: 01234567
1	work: 02345678
2	mobil: 04123456
2	private: 05234567

We need to append a second split operation, this time on the ‘:’ character combination.

In contrast to the previous split we like the resulting fragments expanded into separate columns in the resulting DataFrame.

(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
)

pandas.DataFrame
	phone_type	phone_number
id
1	mobil	01234567
1	work	02345678
2	mobil	04123456
2	private	05234567

When using expand=True, the split elements are expanded into separate columns. Thus, the resulting object is a DataFrame. If NaN is present, it is propagated throughout the columns during the split.

So far, we have transformed the phones Series into a DataFrame with columns phone_type and phone_number.

In the following code snippet we combine the processing of the phones column with a join to the initial DataFrame in one expression. Lastly, we drop the then redundant phones column from the resulting DataFrame.

df.join(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .astype({
        'phone_type': 'category',
        'phone_number': 'string',
    })
).drop('phones', axis='columns')

pandas.DataFrame
	name	phone_type	phone_number
id
1	Russel	mobil	01234567
1	Russel	work	02345678
2	Patrick	mobil	04123456
2	Patrick	private	05234567

Reshaping to wide format

We can reshape the resulting DataFrame to wide format using the following idiom:

DataFrame.set_index('COLUMN', append=True).unstack('COLUMN')

DataFrame.set_index sets the DataFrame index from existing columns. By specifying append=True the column is added to the existing index instead of replacing it. Thus, the returned DataFrame contains a hierarchical index composed of the initial index extended by the specified column.

DataFrame.unstack moves a level from the hierarchical (row) index into a level of the column index.

(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .set_index('phone_type', append=True)
    .unstack('phone_type')
    .droplevel(0, axis='columns')
)

pandas.DataFrame
phone_type	mobil	private	work
id
1	01234567	<NA>	02345678
2	04123456	05234567	<NA>

Now, we can join the reshaped DataFrame to the main data and drop the now redundant phones column.

df.join(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .set_index(['phone_type'], append=True)
    .unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')

pandas.DataFrame
	name	mobil	private	work
id
1	Russel	01234567	<NA>	02345678
2	Patrick	04123456	05234567	<NA>

Data cleaning using regular expressions

We can arrive at the same result as in the previous case with fewer lines of code by leveraging regular expressions in combination with Series.str.extract.

df.join(
    df.phones
    .str.split('\n').explode().astype('string')
    .str.extract(r'^(?P<phone_type>\w+):\s(?P<phone_number>\d+)$')
    .set_index(['phone_type'], append=True)
    .unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')

pandas.DataFrame
	name	mobil	private	work
id
1	Russel	01234567	<NA>	02345678
2	Patrick	04123456	05234567	<NA>

Let us review the regular expression in detail.

Syntax element	Interpretation
`^`	anchors start of string
`(?P<phone_type>\w+)`	capture group labeled “phone_type”
`:\s`	matches separator, not captured
`(?P<phone_number>\d+)`	capture group labeled “phone_number”
`$`	anchors end of string