Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use (lambda x: x["a"] + x["b"])(df)
if really necessary or use df.assign(c=lambda x: x["a"] + x["b"])
(with CoW enabled for performance reasons) which supports chaining!
I've a syntactic sugar hack to make it easier to create and temporarily use derived columns from DataFrames by applying a function on the columns, and I welcome any comments! Here is the code:
import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, numpy: bool = True):
return (f(*(df[col].values
for col in f.__code__.co_varnames)) if numpy else f(
*(df[col] for col in f.__code__.co_varnames)))
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = pda(df, lambda a, b: a + b)
print(df)
This results in:
a b c
0 1 2 3
1 2 3 5
2 3 4 7
3 4 5 9
Advantages:
- Python prettifying and syntax highlighting on function code (as compared to
df["c"] = df.eval("a + b")
) - No need to repeat DataFrame variable name (as compared to
df["c"] = df["a"] + df["b"]
) - Possible to create temporary numpy arrays, and probably better performance (as compared to
df = df.assign(c=lambda x: x["a"] + x["b"])
)
3 Answers 3
Starting broadly: this relies on reflection, which is not unheard of in the data analytics ecosystem (see e.g.: curve_fit's use of argspec). So it wouldn't be entirely without precedent, but it's still in a broad sense not very Pythonic (PEP20's "explicit is better than implicit"). This very much relies on magical, implicit behaviour, and for that reason alone it isn't a wonderful idea.
Python prettifying and syntax highlighting is less important than the related, but fairly different, static analysis. Your approach is only better in terms of static analysis if you jettison the lambda and write an actual function with good typehints; otherwise, it's only marginally better than eval
.
Possible to create temporary numpy arrays, and probably better performance is dubious, and I will not place any belief in this unless I see a benchmark.
Crucially, __code__.co_varnames
is wrong; read the docs:
tuple of names of arguments and local variables
If you have a local variable defined to be the same name as a column from the dataframe, you'll attempt to pass it in and then explosions. Use inspect.signature
instead.
A much simpler technique that I think does cross the line into "worth doing, sometimes" relies on the fact that a DataFrame
is already a map-like:
import pandas as pd
def add(a: pd.Series, b: pd.Series) -> pd.Series:
c = a + b
return c
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df["c"] = add(**df)
print(df)
-
\$\begingroup\$ Thanks for pointing out that problem with
__code__.co_varnames
. I think it is still guaranteed to start with the argument names in the order it appeared, so slicing it works. \$\endgroup\$user1537366– user15373662023年10月10日 03:54:44 +00:00Commented Oct 10, 2023 at 3:54 -
\$\begingroup\$ Your idea about using the map-like property of the DataFrame does not work as soon as it has more columns than you need in the function, and then you will need to slice the DataFrame and this requires repeating the argument names again. (
(lambda a, b: a + b)(**{"a": 10, "b": 20, "c": 30})
throws an error) \$\endgroup\$user1537366– user15373662023年10月10日 04:01:16 +00:00Commented Oct 10, 2023 at 4:01 -
\$\begingroup\$ @user1537366 that's deliberate, but if you don't like it, just add a
**kwargs
. \$\endgroup\$Reinderien– Reinderien2023年10月10日 11:49:46 +00:00Commented Oct 10, 2023 at 11:49
I agree with @Reinderien.
docstring
pda
lacks a docstring, and it absolutely needs one.
Consider using doctest notation at the end of it.
one function or two
def pda( ... , numpy: bool = True):
Thank you for the type hinting.
It's not clear that a "numpy" parameter is a win, here.
Consider offering a pair of functions instead, perhaps pda
and pda_numpy
.
conditional
... if numpy else ...
Sandwiching an if
between large expressions is not helping readability.
Prefer
if numpy:
return ...
else:
return ...
Readability might be improved if we DRY this up a bit.
Consider assigning df[col].values
or df[col]
to a temp var,
and then work with that.
(Since you're keen on automagic,
perhaps use getattr
to probe for a "values" attribute,
and then we don't need a numpy flag?
But it's possible we get a spurious "values" hit.
Maybe consult isinstance
?)
-
\$\begingroup\$ Thanks! Edited and incorporated many of your suggestions. \$\endgroup\$user1537366– user15373662023年10月10日 03:53:45 +00:00Commented Oct 10, 2023 at 3:53
As the original poster, I have revised the code based on the many answers as follows:
Add a docstring
Use the magic doctest for unit testing
Removed the
numpy
parameter (a separate function would probably be better)Separate the if-else expression into an if-else block
Renamed the
numpy
variable touse_numpy
for clarityUse slicing to extract the correct part of
co_varnames
which correspond to the argument names. The docs seem to imply that this works:co_varnames
Returns a tuple containing the names of the local variables (starting with the argument names).
Using inspect.signature
instead of co_varnames
causes a performance hit, so I reverted to using co_varnames
.
import pandas as pd
from typing import Callable
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
"""Performs a function `f` on columns of DataFrame `df`,
as NumPy arrays or as Pandas' Series.
Function `f` will be performed on the columns of `df`
corresponding to the argument names of `f`.
Args:
df (pd.DataFrame): input DataFrame
f (Callable): function to be performed
use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
Returns:
resulting numpy array if `use_numpy` else resulting Series
Example:
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df["e"] = pda(df, lambda c, a: c - a)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
if use_numpy:
return f(*(df[f.__code__.co_varnames[i]].values
for i in range(f.__code__.co_argcount)))
else:
return f(*(df[f.__code__.co_varnames[i]]
for i in range(f.__code__.co_argcount)))
if __name__ == "__main__":
import doctest
doctest.testmod()
I also did some timing comparisons between the methods.
#!/usr/bin/env python3
import inspect
import random
from collections import defaultdict
from typing import Callable
import numpy as np
import pandas as pd
def main():
import doctest
doctest.testmod()
import timeit
df = pd.DataFrame({
"d": np.random.random(100000),
"a": np.random.random(100000),
"c": np.random.random(100000),
"b": np.random.random(100000)
})
tests = [
test_pda, test_pda_series, test_pda2, test_lambda, test_eval,
test_index, test_dot, test_assign
]
timings = defaultdict(float)
for i in range(1000):
random.shuffle(tests)
for test in tests:
timings[test.__name__] += timeit.timeit("test(df)",
number=1,
globals={
"test": test,
"df": df
})
for test_name, timing in timings.items():
print(test_name, timing)
def test_pda(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_pda(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = pda(df, lambda c, a: c - a)
return df
def test_pda_series(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_pda_series(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = pda(df, lambda c, a: c - a, False)
return df
def test_pda2(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_pda2(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = pda2(df, lambda c, a: c - a)
return df
def test_lambda(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_pda(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = (lambda x: x["c"].values - x["a"].values)(df)
return df
def test_eval(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_eval(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = df.eval("c - a")
return df
def test_index(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_index(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = df["c"].values - df["a"].values
return df
def test_dot(df):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_dot(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
df["e"] = df.c.values - df.a.values
return df
def test_assign(df: pd.DataFrame):
"""
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df = test_assign(df)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
return df.assign(e=lambda x: x["c"].values - x["a"].values)
def pda(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
"""Performs a function `f` on columns of DataFrame `df`,
as NumPy arrays or as Pandas' Series.
Function `f` will be performed on the columns of `df`
corresponding to the argument names of `f`.
Args:
df (pd.DataFrame): input DataFrame
f (Callable): function to be performed
use_numpy (bool, optional, defaults to True): use NumPy arrays instead of Series
Returns:
resulting numpy array if `use_numpy` else resulting Series
Example:
```
>>> df = pd.DataFrame({
... "d": [1, 2, 3, 4],
... "a": [2, 3, 4, 5],
... "c": [3, 4, 5, 6],
... "b": [4, 5, 6, 7]
... })
>>> df["e"] = pda(df, lambda c, a: c - a)
>>> print(df)
d a c b e
0 1 2 3 4 1
1 2 3 4 5 1
2 3 4 5 6 1
3 4 5 6 7 1
```
"""
if use_numpy:
return f(*(df[f.__code__.co_varnames[i]].values
for i in range(f.__code__.co_argcount)))
else:
return f(*(df[f.__code__.co_varnames[i]]
for i in range(f.__code__.co_argcount)))
def pda2(df: pd.DataFrame, f: Callable, use_numpy: bool = True):
if use_numpy:
return f(*(df[param.name].values
for param in inspect.signature(f).parameters.values()))
else:
return f(*(df[param.name]
for param in inspect.signature(f).parameters.values()))
if __name__ == "__main__":
main()
The results for Python 3.11.2, Pandas 2.1.1 and NumPy 1.26.0 show that pda
is surprisingly on par in terms of performance as the best other methods (indexing and member access). As expected, .assign
has terrible performance because it is copying the entire DataFrame.
Timings (lower is better):
test_index 0.16944104398862692
test_assign 2.891109986925585
test_pda 0.1570397199393483
test_eval 0.8307543109549442
test_pda2 0.18781333995138993
test_lambda 0.1599503229081165
test_dot 0.16240537503472297
test_pda_series 0.2198283309226099
-
\$\begingroup\$ Maybe consider adding an
arg_names = f.__code__.co_varnames[:f.__code__.co_argcount]
before theif
to reduce line length and ease overall comprehension. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2023年10月10日 09:30:40 +00:00Commented Oct 10, 2023 at 9:30 -
2\$\begingroup\$ Instead of the
argcount
band-aid on what is still the incorrect var name metavariable, you really should just call the better API (inspect.signature
) - or, really, not do any of this. \$\endgroup\$Reinderien– Reinderien2023年10月10日 14:10:50 +00:00Commented Oct 10, 2023 at 14:10 -
1\$\begingroup\$ @SᴀᴍOnᴇᴌᴀ I've made edits to my post \$\endgroup\$user1537366– user15373662023年10月11日 05:56:58 +00:00Commented Oct 11, 2023 at 5:56
-
\$\begingroup\$ @301_Moved_Permanently I did what you suggested, but I'm slightly concerned adding a new variable might spoil bytecode optimisation \$\endgroup\$user1537366– user15373662023年10月11日 06:02:50 +00:00Commented Oct 11, 2023 at 6:02
-
\$\begingroup\$ @Reinderien it is not the "incorrect var name metavariable". The docs guarantee this. \$\endgroup\$user1537366– user15373662023年10月11日 06:04:19 +00:00Commented Oct 11, 2023 at 6:04
df['i'] = pda(df, lambda _, __, c, ___, ____, f, _____, ______: c + f)
instead ofdf['i'] = df.c + df.f
. Is that right ? \$\endgroup\$df['i'] = pda(df, lambda c, f: c + f)
\$\endgroup\$