3
\$\begingroup\$

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use (lambda x: x["a"] + x["b"])(df) if really necessary or use df.assign(c=lambda x: x["a"] + x["b"]) (with CoW enabled for performance reasons) which supports chaining!

I've a syntactic sugar hack to make it easier to create and temporarily use derived columns from DataFrames by applying a function on the columns, and I welcome any comments! Here is the code:

import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, numpy: bool = True):
 return (f(*(df[col].values
 for col in f.__code__.co_varnames)) if numpy else f(
 *(df[col] for col in f.__code__.co_varnames)))
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = pda(df, lambda a, b: a + b)
print(df)

This results in:

 a b c
0 1 2 3
1 2 3 5
2 3 4 7
3 4 5 9

Advantages:

  • Python prettifying and syntax highlighting on function code (as compared to df["c"] = df.eval("a + b"))
  • No need to repeat DataFrame variable name (as compared to df["c"] = df["a"] + df["b"])
  • Possible to create temporary numpy arrays, and probably better performance (as compared to df = df.assign(c=lambda x: x["a"] + x["b"]))
asked Oct 9, 2023 at 10:15
\$\endgroup\$
4
  • 1
    \$\begingroup\$ So, if we had, let's say 8 columns, we could use df['i'] = pda(df, lambda _, __, c, ___, ____, f, _____, ______: c + f) instead of df['i'] = df.c + df.f. Is that right ? \$\endgroup\$ Commented Oct 9, 2023 at 11:07
  • \$\begingroup\$ @301_Moved_Permanently nope, you're supposed to just use df['i'] = pda(df, lambda c, f: c + f) \$\endgroup\$ Commented Oct 10, 2023 at 3:23
  • \$\begingroup\$ Welcome to Code Review! Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so the answers make sense again. \$\endgroup\$ Commented Oct 10, 2023 at 6:55
  • \$\begingroup\$ @TobySpeight got it, I just didn't want people to take my slightly incorrect version of the code to use. \$\endgroup\$ Commented Oct 10, 2023 at 8:18

3 Answers 3

2
\$\begingroup\$

Starting broadly: this relies on reflection, which is not unheard of in the data analytics ecosystem (see e.g.: curve_fit's use of argspec). So it wouldn't be entirely without precedent, but it's still in a broad sense not very Pythonic (PEP20's "explicit is better than implicit"). This very much relies on magical, implicit behaviour, and for that reason alone it isn't a wonderful idea.

Python prettifying and syntax highlighting is less important than the related, but fairly different, static analysis. Your approach is only better in terms of static analysis if you jettison the lambda and write an actual function with good typehints; otherwise, it's only marginally better than eval.

Possible to create temporary numpy arrays, and probably better performance is dubious, and I will not place any belief in this unless I see a benchmark.

Crucially, __code__.co_varnames is wrong; read the docs:

tuple of names of arguments and local variables

If you have a local variable defined to be the same name as a column from the dataframe, you'll attempt to pass it in and then explosions. Use inspect.signature instead.

A much simpler technique that I think does cross the line into "worth doing, sometimes" relies on the fact that a DataFrame is already a map-like:

import pandas as pd
def add(a: pd.Series, b: pd.Series) -> pd.Series:
 c = a + b
 return c
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = add(**df)
print(df)
answered Oct 9, 2023 at 21:45
\$\endgroup\$
3
  • \$\begingroup\$ Thanks for pointing out that problem with __code__.co_varnames. I think it is still guaranteed to start with the argument names in the order it appeared, so slicing it works. \$\endgroup\$ Commented Oct 10, 2023 at 3:54
  • \$\begingroup\$ Your idea about using the map-like property of the DataFrame does not work as soon as it has more columns than you need in the function, and then you will need to slice the DataFrame and this requires repeating the argument names again. ((lambda a, b: a + b)(**{"a": 10, "b": 20, "c": 30}) throws an error) \$\endgroup\$ Commented Oct 10, 2023 at 4:01
  • \$\begingroup\$ @user1537366 that's deliberate, but if you don't like it, just add a **kwargs. \$\endgroup\$ Commented Oct 10, 2023 at 11:49
2
\$\begingroup\$

I agree with @Reinderien.

docstring

pda lacks a docstring, and it absolutely needs one.

Consider using doctest notation at the end of it.

one function or two

def pda( ... , numpy: bool = True):

Thank you for the type hinting.

It's not clear that a "numpy" parameter is a win, here. Consider offering a pair of functions instead, perhaps pda and pda_numpy.

conditional

 ... if numpy else ...

Sandwiching an if between large expressions is not helping readability.

Prefer

 if numpy:
 return ...
 else:
 return ...

Readability might be improved if we DRY this up a bit. Consider assigning df[col].values or df[col] to a temp var, and then work with that.

(Since you're keen on automagic, perhaps use getattr to probe for a "values" attribute, and then we don't need a numpy flag? But it's possible we get a spurious "values" hit. Maybe consult isinstance?)

answered Oct 9, 2023 at 23:32
\$\endgroup\$
1
  • \$\begingroup\$ Thanks! Edited and incorporated many of your suggestions. \$\endgroup\$ Commented Oct 10, 2023 at 3:53
1
\$\begingroup\$

As the original poster, I have revised the code based on the many answers as follows:

  • Add a docstring

  • Use the magic doctest for unit testing

  • Removed the numpy parameter (a separate function would probably be better)

  • Separate the if-else expression into an if-else block

  • Renamed the numpy variable to use_numpy for clarity

  • Use slicing to extract the correct part of co_varnames which correspond to the argument names. The docs seem to imply that this works:

    co_varnames

    Returns a tuple containing the names of the local variables (starting with the argument names).

Using inspect.signature instead of co_varnames causes a performance hit, so I reverted to using co_varnames.

import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
if __name__ == "__main__":
 import doctest
 doctest.testmod()

I also did some timing comparisons between the methods.

#!/usr/bin/env python3
import inspect
import random
from collections import defaultdict
from typing import Callable
import numpy as np
import pandas as pd
def main():
 import doctest
 doctest.testmod()
 import timeit
 df = pd.DataFrame({
 "d": np.random.random(100000),
 "a": np.random.random(100000),
 "c": np.random.random(100000),
 "b": np.random.random(100000)
 })
 tests = [
 test_pda, test_pda_series, test_pda2, test_lambda, test_eval,
 test_index, test_dot, test_assign
 ]
 timings = defaultdict(float)
 for i in range(1000):
 random.shuffle(tests)
 for test in tests:
 timings[test.__name__] += timeit.timeit("test(df)",
 number=1,
 globals={
 "test": test,
 "df": df
 })
 for test_name, timing in timings.items():
 print(test_name, timing)
def test_pda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a)
 return df
def test_pda_series(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda_series(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a, False)
 return df
def test_pda2(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda2(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda2(df, lambda c, a: c - a)
 return df
def test_lambda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = (lambda x: x["c"].values - x["a"].values)(df)
 return df
def test_eval(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_eval(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.eval("c - a")
 return df
def test_index(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_index(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df["c"].values - df["a"].values
 return df
def test_dot(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_dot(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.c.values - df.a.values
 return df
def test_assign(df: pd.DataFrame):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_assign(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 return df.assign(e=lambda x: x["c"].values - x["a"].values)
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
def pda2(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 if use_numpy:
 return f(*(df[param.name].values
 for param in inspect.signature(f).parameters.values()))
 else:
 return f(*(df[param.name]
 for param in inspect.signature(f).parameters.values()))
if __name__ == "__main__":
 main()

The results for Python 3.11.2, Pandas 2.1.1 and NumPy 1.26.0 show that pda is surprisingly on par in terms of performance as the best other methods (indexing and member access). As expected, .assign has terrible performance because it is copying the entire DataFrame.

Timings (lower is better):

test_index 0.16944104398862692
test_assign 2.891109986925585
test_pda 0.1570397199393483
test_eval 0.8307543109549442
test_pda2 0.18781333995138993
test_lambda 0.1599503229081165
test_dot 0.16240537503472297
test_pda_series 0.2198283309226099
answered Oct 10, 2023 at 8:16
\$\endgroup\$
7
  • \$\begingroup\$ Maybe consider adding an arg_names = f.__code__.co_varnames[:f.__code__.co_argcount] before the if to reduce line length and ease overall comprehension. \$\endgroup\$ Commented Oct 10, 2023 at 9:30
  • 2
    \$\begingroup\$ Instead of the argcount band-aid on what is still the incorrect var name metavariable, you really should just call the better API (inspect.signature) - or, really, not do any of this. \$\endgroup\$ Commented Oct 10, 2023 at 14:10
  • 1
    \$\begingroup\$ @SᴀᴍOnᴇᴌᴀ I've made edits to my post \$\endgroup\$ Commented Oct 11, 2023 at 5:56
  • \$\begingroup\$ @301_Moved_Permanently I did what you suggested, but I'm slightly concerned adding a new variable might spoil bytecode optimisation \$\endgroup\$ Commented Oct 11, 2023 at 6:02
  • \$\begingroup\$ @Reinderien it is not the "incorrect var name metavariable". The docs guarantee this. \$\endgroup\$ Commented Oct 11, 2023 at 6:04

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.