Syntactic sugar for derived variables from Pandas DataFrame columns

Question 1

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!

I've a syntactic sugar hack to make it easier to create and temporarily use derived columns from DataFrames by applying a function on the columns, and I welcome any comments! Here is the code:

import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, numpy: bool = True):
 return (f(*(df[col].values
 for col in f.__code__.co_varnames)) if numpy else f(
 *(df[col] for col in f.__code__.co_varnames)))
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = pda(df, lambda a, b: a + b)
print(df)

This results in:

Advantages:

Python prettifying and syntax highlighting on function code (as compared to df["c"] = df.eval("a + b"))
No need to repeat DataFrame variable name (as compared to df["c"] = df["a"] + df["b"])
Possible to create temporary numpy arrays, and probably better performance (as compared to df = df.assign(c=lambda x: x["a"] + x["b"]))

Question 2

So, if we had, let's say 8 columns, we could use df['i'] = pda(df, lambda _, __, c, ___, ____, f, _____, ______: c + f) instead of df['i'] = df.c + df.f. Is that right ?

Question 3

@301_Moved_Permanently nope, you're supposed to just use df['i'] = pda(df, lambda c, f: c + f)

Question 4

Welcome to Code Review! Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so the answers make sense again.

Question 5

@TobySpeight got it, I just didn't want people to take my slightly incorrect version of the code to use.

Question 6

Starting broadly: this relies on reflection, which is not unheard of in the data analytics ecosystem (see e.g.: curve_fit's use of argspec). So it wouldn't be entirely without precedent, but it's still in a broad sense not very Pythonic (PEP20's "explicit is better than implicit"). This very much relies on magical, implicit behaviour, and for that reason alone it isn't a wonderful idea.

Python prettifying and syntax highlighting is less important than the related, but fairly different, static analysis. Your approach is only better in terms of static analysis if you jettison the lambda and write an actual function with good typehints; otherwise, it's only marginally better than eval.

Possible to create temporary numpy arrays, and probably better performance is dubious, and I will not place any belief in this unless I see a benchmark.

Crucially, __code__.co_varnames is wrong; read the docs:

tuple of names of arguments and local variables

If you have a local variable defined to be the same name as a column from the dataframe, you'll attempt to pass it in and then explosions. Use inspect.signature instead.

A much simpler technique that I think does cross the line into "worth doing, sometimes" relies on the fact that a DataFrame is already a map-like:

import pandas as pd
def add(a: pd.Series, b: pd.Series) -> pd.Series:
 c = a + b
 return c
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = add(**df)
print(df)

Question 7

Thanks for pointing out that problem with __code__.co_varnames. I think it is still guaranteed to start with the argument names in the order it appeared, so slicing it works.

Question 8

Your idea about using the map-like property of the DataFrame does not work as soon as it has more columns than you need in the function, and then you will need to slice the DataFrame and this requires repeating the argument names again. ((lambda a, b: a + b)(**{"a": 10, "b": 20, "c": 30}) throws an error)

Question 9

@user1537366 that's deliberate, but if you don't like it, just add a **kwargs.

Question 10

I agree with @Reinderien.

docstring

pda lacks a docstring, and it absolutely needs one.

Consider using doctest notation at the end of it.

one function or two

def pda( ... , numpy: bool = True):

Thank you for the type hinting.

It's not clear that a "numpy" parameter is a win, here. Consider offering a pair of functions instead, perhaps pda and pda_numpy.

conditional

 ... if numpy else ...

Sandwiching an if between large expressions is not helping readability.

Prefer

 if numpy:
 return ...
 else:
 return ...

Readability might be improved if we DRY this up a bit. Consider assigning df[col].values or df[col] to a temp var, and then work with that.

(Since you're keen on automagic, perhaps use getattr to probe for a "values" attribute, and then we don't need a numpy flag? But it's possible we get a spurious "values" hit. Maybe consult isinstance?)

Question 11

Thanks! Edited and incorporated many of your suggestions.

Question 12

As the original poster, I have revised the code based on the many answers as follows:

Add a docstring
Use the magic doctest for unit testing
Removed the numpy parameter (a separate function would probably be better)
Separate the if-else expression into an if-else block
Renamed the numpy variable to use_numpy for clarity
Use slicing to extract the correct part of co_varnames which correspond to the argument names. The docs seem to imply that this works:

co_varnames

Returns a tuple containing the names of the local variables (starting with the argument names).

Using inspect.signature instead of co_varnames causes a performance hit, so I reverted to using co_varnames.

import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
if __name__ == "__main__":
 import doctest
 doctest.testmod()

I also did some timing comparisons between the methods.

#!/usr/bin/env python3
import inspect
import random
from collections import defaultdict
from typing import Callable
import numpy as np
import pandas as pd
def main():
 import doctest
 doctest.testmod()
 import timeit
 df = pd.DataFrame({
 "d": np.random.random(100000),
 "a": np.random.random(100000),
 "c": np.random.random(100000),
 "b": np.random.random(100000)
 })
 tests = [
 test_pda, test_pda_series, test_pda2, test_lambda, test_eval,
 test_index, test_dot, test_assign
 ]
 timings = defaultdict(float)
 for i in range(1000):
 random.shuffle(tests)
 for test in tests:
 timings[test.__name__] += timeit.timeit("test(df)",
 number=1,
 globals={
 "test": test,
 "df": df
 })
 for test_name, timing in timings.items():
 print(test_name, timing)
def test_pda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a)
 return df
def test_pda_series(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda_series(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a, False)
 return df
def test_pda2(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda2(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda2(df, lambda c, a: c - a)
 return df
def test_lambda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = (lambda x: x["c"].values - x["a"].values)(df)
 return df
def test_eval(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_eval(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.eval("c - a")
 return df
def test_index(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_index(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df["c"].values - df["a"].values
 return df
def test_dot(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_dot(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.c.values - df.a.values
 return df
def test_assign(df: pd.DataFrame):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_assign(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 return df.assign(e=lambda x: x["c"].values - x["a"].values)
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
def pda2(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 if use_numpy:
 return f(*(df[param.name].values
 for param in inspect.signature(f).parameters.values()))
 else:
 return f(*(df[param.name]
 for param in inspect.signature(f).parameters.values()))
if __name__ == "__main__":
 main()

The results for Python 3.11.2, Pandas 2.1.1 and NumPy 1.26.0 show that pda is surprisingly on par in terms of performance as the best other methods (indexing and member access). As expected, .assign has terrible performance because it is copying the entire DataFrame.

Timings (lower is better):

test_index 0.16944104398862692
test_assign 2.891109986925585
test_pda 0.1570397199393483
test_eval 0.8307543109549442
test_pda2 0.18781333995138993
test_lambda 0.1599503229081165
test_dot 0.16240537503472297
test_pda_series 0.2198283309226099

Question 13

Maybe consider adding an arg_names = f.__code__.co_varnames[:f.__code__.co_argcount] before the if to reduce line length and ease overall comprehension.

Question 14

Instead of the argcount band-aid on what is still the incorrect var name metavariable, you really should just call the better API (inspect.signature) - or, really, not do any of this.

Question 15

@SᴀᴍOnᴇᴌᴀ I've made edits to my post

Question 16

@301_Moved_Permanently I did what you suggested, but I'm slightly concerned adding a new variable might spoil bytecode optimisation

Question 17

@Reinderien it is not the "incorrect var name metavariable". The docs guarantee this.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2023-10-09 21:45:03Z

Starting broadly: this relies on reflection, which is not unheard of in the data analytics ecosystem (see e.g.: curve_fit's use of argspec). So it wouldn't be entirely without precedent, but it's still in a broad sense not very Pythonic (PEP20's "explicit is better than implicit"). This very much relies on magical, implicit behaviour, and for that reason alone it isn't a wonderful idea.

Python prettifying and syntax highlighting is less important than the related, but fairly different, static analysis. Your approach is only better in terms of static analysis if you jettison the lambda and write an actual function with good typehints; otherwise, it's only marginally better than eval.

Possible to create temporary numpy arrays, and probably better performance is dubious, and I will not place any belief in this unless I see a benchmark.

Crucially, __code__.co_varnames is wrong; read the docs:

tuple of names of arguments and local variables

If you have a local variable defined to be the same name as a column from the dataframe, you'll attempt to pass it in and then explosions. Use inspect.signature instead.

A much simpler technique that I think does cross the line into "worth doing, sometimes" relies on the fact that a DataFrame is already a map-like:

import pandas as pd
def add(a: pd.Series, b: pd.Series) -> pd.Series:
 c = a + b
 return c
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = add(**df)
print(df)

Thanks for pointing out that problem with __code__.co_varnames. I think it is still guaranteed to start with the argument names in the order it appeared, so slicing it works.
Your idea about using the map-like property of the DataFrame does not work as soon as it has more columns than you need in the function, and then you will need to slice the DataFrame and this requires repeating the argument names again. ((lambda a, b: a + b)(**{"a": 10, "b": 20, "c": 30}) throws an error)
@user1537366 that's deliberate, but if you don't like it, just add a **kwargs.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 2 · 2023-10-09 23:32:17Z

I agree with @Reinderien.

docstring

pda lacks a docstring, and it absolutely needs one.

Consider using doctest notation at the end of it.

one function or two

def pda( ... , numpy: bool = True):

Thank you for the type hinting.

It's not clear that a "numpy" parameter is a win, here. Consider offering a pair of functions instead, perhaps pda and pda_numpy.

conditional

 ... if numpy else ...

Sandwiching an if between large expressions is not helping readability.

Prefer

 if numpy:
 return ...
 else:
 return ...

Readability might be improved if we DRY this up a bit. Consider assigning df[col].values or df[col] to a temp var, and then work with that.

(Since you're keen on automagic, perhaps use getattr to probe for a "values" attribute, and then we don't need a numpy flag? But it's possible we get a spurious "values" hit. Maybe consult isinstance?)

\$\begingroup\$ Thanks! Edited and incorporated many of your suggestions. \$\endgroup\$

user1537366
– user1537366

2023年10月10日 03:53:45 +00:00
Commented Oct 10, 2023 at 3:53

user1537366 user1537366 1615 bronze badges · Answer 3 · 2023-10-10 08:16:34Z

As the original poster, I have revised the code based on the many answers as follows:

Add a docstring
Use the magic doctest for unit testing
Removed the numpy parameter (a separate function would probably be better)
Separate the if-else expression into an if-else block
Renamed the numpy variable to use_numpy for clarity
Use slicing to extract the correct part of co_varnames which correspond to the argument names. The docs seem to imply that this works:

co_varnames

Returns a tuple containing the names of the local variables (starting with the argument names).

Using inspect.signature instead of co_varnames causes a performance hit, so I reverted to using co_varnames.

import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
if __name__ == "__main__":
 import doctest
 doctest.testmod()

I also did some timing comparisons between the methods.

#!/usr/bin/env python3
import inspect
import random
from collections import defaultdict
from typing import Callable
import numpy as np
import pandas as pd
def main():
 import doctest
 doctest.testmod()
 import timeit
 df = pd.DataFrame({
 "d": np.random.random(100000),
 "a": np.random.random(100000),
 "c": np.random.random(100000),
 "b": np.random.random(100000)
 })
 tests = [
 test_pda, test_pda_series, test_pda2, test_lambda, test_eval,
 test_index, test_dot, test_assign
 ]
 timings = defaultdict(float)
 for i in range(1000):
 random.shuffle(tests)
 for test in tests:
 timings[test.__name__] += timeit.timeit("test(df)",
 number=1,
 globals={
 "test": test,
 "df": df
 })
 for test_name, timing in timings.items():
 print(test_name, timing)
def test_pda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a)
 return df
def test_pda_series(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda_series(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda(df, lambda c, a: c - a, False)
 return df
def test_pda2(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda2(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = pda2(df, lambda c, a: c - a)
 return df
def test_lambda(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_pda(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = (lambda x: x["c"].values - x["a"].values)(df)
 return df
def test_eval(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_eval(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.eval("c - a")
 return df
def test_index(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_index(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df["c"].values - df["a"].values
 return df
def test_dot(df):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_dot(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 df["e"] = df.c.values - df.a.values
 return df
def test_assign(df: pd.DataFrame):
 """
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df = test_assign(df)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 return df.assign(e=lambda x: x["c"].values - x["a"].values)
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 """Performs a function `f` on columns of DataFrame `df`,
 as NumPy arrays or as Pandas' Series.
 
 Function `f` will be performed on the columns of `df`
 corresponding to the argument names of `f`.
 Args:
 df (pd.DataFrame): input DataFrame
 f (Callable): function to be performed
 use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
 Returns:
 resulting numpy array if `use_numpy` else resulting Series
 Example:
 ```
 >>> df = pd.DataFrame({
 ... "d": [1, 2, 3, 4],
 ... "a": [2, 3, 4, 5],
 ... "c": [3, 4, 5, 6],
 ... "b": [4, 5, 6, 7]
 ... })
 >>> df["e"] = pda(df, lambda c, a: c - a)
 >>> print(df)
 d a c b e
 0 1 2 3 4 1
 1 2 3 4 5 1
 2 3 4 5 6 1
 3 4 5 6 7 1
 
 ```
 """
 if use_numpy:
 return f(*(df[f.__code__.co_varnames[i]].values
 for i in range(f.__code__.co_argcount)))
 else:
 return f(*(df[f.__code__.co_varnames[i]]
 for i in range(f.__code__.co_argcount)))
def pda2(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
 if use_numpy:
 return f(*(df[param.name].values
 for param in inspect.signature(f).parameters.values()))
 else:
 return f(*(df[param.name]
 for param in inspect.signature(f).parameters.values()))
if __name__ == "__main__":
 main()

The results for Python 3.11.2, Pandas 2.1.1 and NumPy 1.26.0 show that pda is surprisingly on par in terms of performance as the best other methods (indexing and member access). As expected, .assign has terrible performance because it is copying the entire DataFrame.

Timings (lower is better):

test_index 0.16944104398862692
test_assign 2.891109986925585
test_pda 0.1570397199393483
test_eval 0.8307543109549442
test_pda2 0.18781333995138993
test_lambda 0.1599503229081165
test_dot 0.16240537503472297
test_pda_series 0.2198283309226099

Maybe consider adding an arg_names = f.__code__.co_varnames[:f.__code__.co_argcount] before the if to reduce line length and ease overall comprehension.
Instead of the argcount band-aid on what is still the incorrect var name metavariable, you really should just call the better API (inspect.signature) - or, really, not do any of this.
@301_Moved_Permanently I did what you suggested, but I'm slightly concerned adding a new variable might spoil bytecode optimisation
@Reinderien it is not the "incorrect var name metavariable". The docs guarantee this.

Stack Exchange Network

Syntactic sugar for derived variables from Pandas DataFrame columns

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!

3 Answers 3

docstring

one function or two

conditional

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Syntactic sugar for derived variables from Pandas DataFrame columns

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use (lambda x: x["a"] + x["b"])(df) if really necessary or use df.assign(c=lambda x: x["a"] + x["b"]) (with CoW enabled for performance reasons) which supports chaining!

3 Answers 3

docstring

one function or two

conditional

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use `(lambda x: x["a"] + x["b"])(df)` if really necessary or use `df.assign(c=lambda x: x["a"] + x["b"])` (with CoW enabled for performance reasons) which supports chaining!