pydash - a Swiss Army knife for Python

Posted in Python, Data Processing on September 4, 2022 by tzelleke ‐ 12 min read

With more than 50K stars on GitHub Lodash continues to be one of the most popular JavaScript libraries. Lodash provides numerous methods covering the most common needs when handling data in JavaScript. pydash is a Lodash port to Python allowing Python developers to write expressive code and reap similar productivity gains.

Let us start exploring pydash's API by answering a few questions about this dataset of olympic mens 100m medalists.

import pandas as pd

from pydash import py_

SOURCE_URL = 'https://raw.githubusercontent.com/bokeh/bokeh/branch-3.0/src/bokeh/sampledata/_data/sprint.csv'

ol_mens_100 = pd.read_csv(
    SOURCE_URL,
    skipinitialspace=True,
    escapechar='\\'
).to_dict(orient='records')

We use pandas to download and parse the dataset. Then we export the DataFrame to a Python list of dicts using the option orient='records'. We also import the py_ object from pydash which corresponds to the _ object in Lodash and provides access to pydash's arsenal of methods.

What is the temporal range of this dataset?

py_.over([min, max])(
    py_.pluck(ol_mens_100, 'Year')
)

[1896, 2012]

over creates a function that invokes all provided functions – here, Python’s build in min and max functions – with the arguments it receives and returns their results.

pluck retrieves the value of a specified property/key from all elements in a collection.

Who was the first Olympic champion?

(
    py_(ol_mens_100)
    .filter({'Medal': 'GOLD'})
    .min_by('Year')
    .value()
)

{'Name': 'Thomas Burke',
 'Country': 'USA',
 'Medal': 'GOLD',
 'Time': 12.0,
 'Year': 1896}

Here, we encounter a powerful feature of pydash, method chaining. Chained methods get their first parameter via the chain. Evaluation of the method chain is deferred (lazy) until .value() is called.

filter use matches here
opposite reject is also available

py_.over([
    py_().max_by('Year'),
    py_().min_by('Year'),
])(ol_mens_100)

[{'Name': 'Usain Bolt',
  'Country': 'JAM',
  'Medal': 'GOLD',
  'Time': 9.63,
  'Year': 2012},
 {'Name': 'Thomas Burke',
  'Country': 'USA',
  'Medal': 'GOLD',
  'Time': 12.0,
  'Year': 1896}]

Who was a medalist at multiple Olympics?

(
    py_(ol_mens_100)
    .group_by('Name')
    .omit_by(lambda group: len(group) == 1)
    .keys()
    .value()
)

['Usain Bolt',
 'Justin Gatlin',
 'Maurice Greene',
 'Ato Boldon',
 'Frankie Fredericks',
 'Linford Christie',
 'Carl Lewis',
 'Valery Borzov',
 'Lennox Miller',
 'Ralph Metcalfe',
 'Charles "Archie" Hahn']

omit_by -> pick_by
keys -> values

What is the medal count by country?

Let us sort the countries by their medal count in descending order.

(
    py_(ol_mens_100)
    .count_by('Country')
    .map(lambda count, country: dict(County=country, Count=count))
    .sort_by('Count', reverse=True)
    .take(3)
    .value()
)

[{'County': 'USA', 'Count': 40},
 {'County': 'GBR', 'Count': 8},
 {'County': 'JAM', 'Count': 7}]

Which are the latest countries to have spawned an Olympic medalist?

(
    py_(ol_mens_100)
    .group_by('Country')
    .flat_map(lambda group: py_.min_by(group, 'Year'))
    .sort_by('Year', reverse=True)
    .take(3)
    .value()
)

[{'Name': 'Francis Obikwelu',
  'Country': 'POR',
  'Medal': 'SILVER',
  'Time': 9.86,
  'Year': 2004},
 {'Name': 'Obadele Thompson',
  'Country': 'BAR',
  'Medal': 'BRONZE',
  'Time': 10.04,
  'Year': 2000},
 {'Name': 'Frankie Fredericks',
  'Country': 'NAM',
  'Medal': 'SILVER',
  'Time': 10.02,
  'Year': 1992}]

Thus far, we have used pydash for searching and manipulating collections of data. In the following section we are going to compose higher order functionality to operate on pandas DataFrames.

Using pydash with pandas

SOURCE_URL = 'https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv'

oscars = pd.read_csv(
    SOURCE_URL,
    dtype=dict(Winner='boolean'),
    usecols=['Year', 'Winner', 'Award', 'Name']
).rename(columns=py_.slugify)
oscars.sample()

pandas.DataFrame
	year	award	winner	name
7860	1998	Directing	<NA>	The Thin Red Line

We are able to filter this DataFrame with multi-column criteria using following syntax.

oscars.loc[
    oscars.winner
    & (oscars.year == '2001')
    & (oscars.award == 'Best Picture')
]

pandas.DataFrame
	year	award	winner	name
8237	2001	Best Picture	True	A Beautiful Mind

This is fine for interactive data exploration. But what if we like to expose this filter functionality in a configurable manor, for instance consuming filter conditions that originate from an API request. Since loc also accepts a callable, we can us pydash to compose a function that versatile filters any DataFrame.

def _filter(*filters):
    filters = [
        py_().get(col).invoke(method, param)
        for col, method, param in filters
    ]

    return (
        py_()
        .thru(py_.over(filters))
        .reduce(lambda s1, s2: s1.__and__(s2))
    )

Here, we encounter yet another feature in pydash – method chaining with late value passing. When we initialize a method chain without a root value, ie. py_().<method chain> we obtain a callable instance of the method chain. Thus, late value passing allows for the creation of ad-hoc functions via the chaining syntax which can be re-used with different root values supplied.

Let us test it out.

oscars.loc[_filter(
    ('winner', 'eq', True),
    ('year', 'gt', '2010'),
    ('award', 'eq', 'Best Picture'),
)]

pandas.DataFrame
	year	award	winner	name
9409	2011	Best Picture	True	The Artist
9532	2012	Best Picture	True	Argo
9665	2013	Best Picture	True	12 Years a Slave
9787	2014	Best Picture	True	Birdman or (The Unexpected Virtue of Ignorance)
9920	2015	Best Picture	True	Spotlight

pydash pandas

pydash - a Swiss Army knife for Python

What is the temporal range of this dataset? #

Who was the first Olympic champion? #

Who was a medalist at multiple Olympics? #

What is the medal count by country? #

Which are the latest countries to have spawned an Olympic medalist? #

Using pydash with pandas #

Related posts

Reading CSV data into pandas

Text processing in pandas