Text processing in pandas

Posted in Data Processing, Python on August 19, 2022 by tzelleke ‐ 15 min read

Text processing functionality in pandas is located under the Series.str namespace.

Let us create some dummy data and load it into a DataFrame.

We import io besides pandas and use io.StringIO to provide file-like input to pandas.read_csv.

We specify that the id column should populate the DataFrame’s index. Additionally, we explicitely specify string datatype for name and phones columns.

import io
import pandas as pd

csv = io.StringIO('\n'.join([
    'id,name,phones',
    '1,Russel,"01234567\n02345678\n03456789"',
    '2,Patrick,"04123456\n05234567\n06345678"',
]))

df = pd.read_csv(
    csv,
    index_col='id',
    dtype={
        'name': 'string',
        'phones': 'string',
    },
)
df
pandas.DataFrame
 namephones
id  
1Russel01234567
02345678
03456789
2Patrick04123456
05234567
06345678

Split into rows

Entries in the phones column contain strings with multiple phone numbers separated by line break. To make individual phone numbers accessible for further processing/analysis we need to extract phone numbers into individual rows of the DataFrame. We can do that in two steps in pandas:

  1. Split each entry in the phones column into a list of strings using Series.str.split.
  2. Distribute list elements to individual rows using Series.explode.
The diagram visualizes the two-step process of extracting individual text lines from multiline text cells in pandas.
Two-step process splitting multiline strings into individual rows in pandas.

Series.str.split splits each string in a Series on a given separator and returns a Series of lists.

Series.explode transforms list items in a Series of lists (or list-like objects) into separate entries with index labels replicated.

When we perform these steps on the phones column we obtain a Series of individual phone numbers with index labels carried over from source entries.

df.phones.str.split('\n').explode().astype('string')
pandas.Series
id 
101234567
102345678
103456789
204123456
205234567
206345678

At this point we can join the resulting Series to our initial DataFrame. Instead, we can arrive at the same result in a single expression.

df.assign(
    phones=df.phones.str.split('\n')
).explode('phones').astype({'phones': 'string'})
pandas.DataFrame
 namephones
id  
1Russel01234567
1Russel02345678
1Russel03456789
2Patrick04123456
2Patrick05234567
2Patrick06345678

First, we use DataFrame.assign to overwrite the initial phones column with the Series returned from the split operation.

Then we call explode('phones') on the resulting DataFrame. DataFrame.explode works like Series.explode yet expects a column to operate on. Lastly, we change the datatype of the new phones column to string.

After this operation the index id is no longer unique.

Split into columns

Now we are going to look at the case where the phones column contains additional information about the type of a phone number.

csv = io.StringIO('\n'.join([
    'id,name,phones',
    '1,Russel,"mobil: 01234567\nwork: 02345678"',
    '2,Patrick,"mobil: 04123456\nprivate: 05234567"',
]))

df = pd.read_csv(
    csv,
    index_col='id',
    dtype={'name': 'string', 'phones': 'string'},
)
df
pandas.DataFrame
 namephones
id  
1Russelmobil: 01234567
work: 02345678
2Patrickmobil: 04123456
private: 05234567

As in the previous case we start out by splitting on linebreak and exploding to individual entries.

df.phones.str.split('\n').explode().astype('string')
pandas.Series
id 
1mobil: 01234567
1work: 02345678
2mobil: 04123456
2private: 05234567

We need to append a second split operation, this time on the ‘:’ character combination.

In contrast to the previous split we like the resulting fragments expanded into separate columns in the resulting DataFrame.

(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
)
pandas.DataFrame
 phone_typephone_number
id  
1mobil01234567
1work02345678
2mobil04123456
2private05234567

When using expand=True, the split elements are expanded into separate columns. Thus, the resulting object is a DataFrame. If NaN is present, it is propagated throughout the columns during the split.

So far, we have transformed the phones Series into a DataFrame with columns phone_type and phone_number.

In the following code snippet we combine the processing of the phones column with a join to the initial DataFrame in one expression. Lastly, we drop the then redundant phones column from the resulting DataFrame.

df.join(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .astype({
        'phone_type': 'category',
        'phone_number': 'string',
    })
).drop('phones', axis='columns')
pandas.DataFrame
 namephone_typephone_number
id   
1Russelmobil01234567
1Russelwork02345678
2Patrickmobil04123456
2Patrickprivate05234567

Reshaping to wide format

We can reshape the resulting DataFrame to wide format using the following idiom:

DataFrame.set_index('COLUMN', append=True).unstack('COLUMN')

DataFrame.set_index sets the DataFrame index from existing columns. By specifying append=True the column is added to the existing index instead of replacing it. Thus, the returned DataFrame contains a hierarchical index composed of the initial index extended by the specified column.

DataFrame.unstack moves a level from the hierarchical (row) index into a level of the column index.

(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .set_index('phone_type', append=True)
    .unstack('phone_type')
    .droplevel(0, axis='columns')
)
pandas.DataFrame
phone_typemobilprivatework
id   
101234567<NA>02345678
20412345605234567<NA>

Now, we can join the reshaped DataFrame to the main data and drop the now redundant phones column.

df.join(
    df.phones
    .str.split('\n').explode()
    .str.split(': ', expand=True)
    .astype('string')
    .rename(columns={
        0: 'phone_type',
        1: 'phone_number',
    })
    .set_index(['phone_type'], append=True)
    .unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')
pandas.DataFrame
 namemobilprivatework
id    
1Russel01234567<NA>02345678
2Patrick0412345605234567<NA>

Data cleaning using regular expressions

We can arrive at the same result as in the previous case with fewer lines of code by leveraging regular expressions in combination with Series.str.extract.

df.join(
    df.phones
    .str.split('\n').explode().astype('string')
    .str.extract(r'^(?P<phone_type>\w+):\s(?P<phone_number>\d+)$')
    .set_index(['phone_type'], append=True)
    .unstack('phone_type').droplevel(0, axis='columns')
).drop('phones', axis='columns')
pandas.DataFrame
 namemobilprivatework
id    
1Russel01234567<NA>02345678
2Patrick0412345605234567<NA>

Let us review the regular expression in detail.

Syntax elementInterpretation
^anchors start of string
(?P<phone_type>\w+)capture group labeled “phone_type”
:\smatches separator, not captured
(?P<phone_number>\d+)capture group labeled “phone_number”
$anchors end of string

Named regular expression capture groups have the following structure in Python:
(?P<NAME>PATTERN).