pydash - a Swiss Army knife for Python

Posted in Python, Data Processing on September 4, 2022 by tzelleke ‐ 12 min read

With more than 50K stars on GitHub Lodash continues to be one of the most popular JavaScript libraries. Lodash provides numerous methods covering the most common needs when handling data in JavaScript. pydash is a Lodash port to Python allowing Python developers to write expressive code and reap similar productivity gains.

Let us start exploring pydash's API by answering a few questions about this dataset of olympic mens 100m medalists.

import pandas as pd

from pydash import py_

SOURCE_URL = 'https://raw.githubusercontent.com/bokeh/bokeh/branch-3.0/src/bokeh/sampledata/_data/sprint.csv'

ol_mens_100 = pd.read_csv(
    SOURCE_URL,
    skipinitialspace=True,
    escapechar='\\'
).to_dict(orient='records')

We use pandas to download and parse the dataset. Then we export the DataFrame to a Python list of dicts using the option orient='records'. We also import the py_ object from pydash which corresponds to the _ object in Lodash and provides access to pydash's arsenal of methods.

What is the temporal range of this dataset?

py_.over([min, max])(
    py_.pluck(ol_mens_100, 'Year')
)
[1896, 2012]

over creates a function that invokes all provided functions – here, Python’s build in min and max functions – with the arguments it receives and returns their results.

pluck retrieves the value of a specified property/key from all elements in a collection.

Who was the first Olympic champion?

(
    py_(ol_mens_100)
    .filter({'Medal': 'GOLD'})
    .min_by('Year')
    .value()
)
{'Name': 'Thomas Burke',
 'Country': 'USA',
 'Medal': 'GOLD',
 'Time': 12.0,
 'Year': 1896}

Here, we encounter a powerful feature of pydash, method chaining. Chained methods get their first parameter via the chain. Evaluation of the method chain is deferred (lazy) until .value() is called.

  • filter use matches here
  • opposite reject is also available
py_.over([
    py_().max_by('Year'),
    py_().min_by('Year'),
])(ol_mens_100)
[{'Name': 'Usain Bolt',
  'Country': 'JAM',
  'Medal': 'GOLD',
  'Time': 9.63,
  'Year': 2012},
 {'Name': 'Thomas Burke',
  'Country': 'USA',
  'Medal': 'GOLD',
  'Time': 12.0,
  'Year': 1896}]

Who was a medalist at multiple Olympics?

(
    py_(ol_mens_100)
    .group_by('Name')
    .omit_by(lambda group: len(group) == 1)
    .keys()
    .value()
)
['Usain Bolt',
 'Justin Gatlin',
 'Maurice Greene',
 'Ato Boldon',
 'Frankie Fredericks',
 'Linford Christie',
 'Carl Lewis',
 'Valery Borzov',
 'Lennox Miller',
 'Ralph Metcalfe',
 'Charles "Archie" Hahn']
  • omit_by -> pick_by
  • keys -> values

What is the medal count by country?

Let us sort the countries by their medal count in descending order.

(
    py_(ol_mens_100)
    .count_by('Country')
    .map(lambda count, country: dict(County=country, Count=count))
    .sort_by('Count', reverse=True)
    .take(3)
    .value()
)
[{'County': 'USA', 'Count': 40},
 {'County': 'GBR', 'Count': 8},
 {'County': 'JAM', 'Count': 7}]

Which are the latest countries to have spawned an Olympic medalist?

(
    py_(ol_mens_100)
    .group_by('Country')
    .flat_map(lambda group: py_.min_by(group, 'Year'))
    .sort_by('Year', reverse=True)
    .take(3)
    .value()
)
[{'Name': 'Francis Obikwelu',
  'Country': 'POR',
  'Medal': 'SILVER',
  'Time': 9.86,
  'Year': 2004},
 {'Name': 'Obadele Thompson',
  'Country': 'BAR',
  'Medal': 'BRONZE',
  'Time': 10.04,
  'Year': 2000},
 {'Name': 'Frankie Fredericks',
  'Country': 'NAM',
  'Medal': 'SILVER',
  'Time': 10.02,
  'Year': 1992}]

Thus far, we have used pydash for searching and manipulating collections of data. In the following section we are going to compose higher order functionality to operate on pandas DataFrames.

Using pydash with pandas

SOURCE_URL = 'https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv'

oscars = pd.read_csv(
    SOURCE_URL,
    dtype=dict(Winner='boolean'),
    usecols=['Year', 'Winner', 'Award', 'Name']
).rename(columns=py_.slugify)
oscars.sample()
pandas.DataFrame
 yearawardwinnername
78601998Directing<NA>The Thin Red Line

We are able to filter this DataFrame with multi-column criteria using following syntax.

oscars.loc[
    oscars.winner
    & (oscars.year == '2001')
    & (oscars.award == 'Best Picture')
]
pandas.DataFrame
 yearawardwinnername
82372001Best PictureTrueA Beautiful Mind

This is fine for interactive data exploration. But what if we like to expose this filter functionality in a configurable manor, for instance consuming filter conditions that originate from an API request. Since loc also accepts a callable, we can us pydash to compose a function that versatile filters any DataFrame.

def _filter(*filters):
    filters = [
        py_().get(col).invoke(method, param)
        for col, method, param in filters
    ] 
    
    return (
        py_()
        .thru(py_.over(filters))
        .reduce(lambda s1, s2: s1.__and__(s2))
    )

Here, we encounter yet another feature in pydash – method chaining with late value passing. When we initialize a method chain without a root value, ie. py_().<method chain> we obtain a callable instance of the method chain. Thus, late value passing allows for the creation of ad-hoc functions via the chaining syntax which can be re-used with different root values supplied.

Let us test it out.

oscars.loc[_filter(
    ('winner', 'eq', True),
    ('year', 'gt', '2010'),
    ('award', 'eq', 'Best Picture'),
)]
pandas.DataFrame
 yearawardwinnername
94092011Best PictureTrueThe Artist
95322012Best PictureTrueArgo
96652013Best PictureTrue12 Years a Slave
97872014Best PictureTrueBirdman or (The Unexpected Virtue of Ignorance)
99202015Best PictureTrueSpotlight