I have a pandas dataframe (called base_mortality
) with 1 column and n rows, which is of the following form:
age | death_prob
---------------------------
60 | 0.005925
61 | 0.006656
62 | 0.007474
63 | 0.008387
64 | 0.009405
65 | 0.010539
66 | 0.0118
67 | 0.013201
68 | 0.014756
69 | 0.016477
age
is the index and death_prob
is the probability that a person who is a given age will die in the next year. I want to use these death probabilities to project the expected annuity payment that would be paid to an annuitant over the next t years.
Suppose I have 3 annuitants, whose names and ages are contained in a dictionary:
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
Then I would want to construct a new dataframe whose index is time (rather than age) which has 3 columns (one for each annuitant) and t rows (one for each time step). Each column should specify the probability of death for each policy holder at that time step. For example:
John Mike Alan
0 0.010539 0.013201 0.020486
1 0.011800 0.014756 0.022807
2 0.013201 0.016477 0.025365
3 0.014756 0.018382 0.028179
4 0.016477 0.020486 0.031269
.. ... ... ...
96 1.000000 1.000000 1.000000
97 1.000000 1.000000 1.000000
98 1.000000 1.000000 1.000000
99 1.000000 1.000000 1.000000
100 1.000000 1.000000 1.000000
At present, my code for doing this is as follows:
import pandas as pd
base_mortality = pd.read_csv('/Users/joshchapman/PycharmProjects/VectorisedAnnuityModel/venv/assumptions/base_mortality.csv', index_col=['x'])
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
out = pd.DataFrame(index=range(0,101))
for name, age in policy_holders.items():
out[name] = base_mortality.loc[age:].reset_index()['age']
out = out.fillna(1)
print(out)
However, my aim is to remove this loop and achieve this using vector operations (i.e. pandas and/or numpy functions). Any suggestions on how I might improve my code to work in this way would be great!
1 Answer 1
Enter pandas.cut
. It returns the bin in which each event lies. You can even pass the labels directly. This way you can reduce it to a Python loop over the people:
import pandas as pd
import numpy as np
age_bins = range(59, 70) # one more than the probabilities
death_prob = [0.005925, 0.006656, 0.007474, 0.008387, 0.009405, 0.010539, 0.0118,
0.013201, 0.014756, 0.016477]
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
values = {name: pd.cut(range(age, age + 101), age_bins, labels=death_prob)
for name, age in policy_holders.items()}
out = pd.DataFrame(values, dtype=np.float64).fillna(1)
print(out)
# John Mike Alan
# 0 0.010539 0.013201 1.0
# 1 0.011800 0.014756 1.0
# 2 0.013201 0.016477 1.0
# 3 0.014756 1.000000 1.0
# 4 0.016477 1.000000 1.0
# .. ... ... ...
# 96 1.000000 1.000000 1.0
# 97 1.000000 1.000000 1.0
# 98 1.000000 1.000000 1.0
# 99 1.000000 1.000000 1.0
# 100 1.000000 1.000000 1.0
#
# [101 rows x 3 columns]
Note that the hin edges need to be one larger than the labels, because technically, this is interpreted as (59, 60], (60, 61], ...
, i.e. including the right edge.
-
\$\begingroup\$ Thanks for your help on this one! Quick question though: what if the probabilities are not unique? I've tried replacing the last probability with the second to last and this gives the error
Categorical categories must be unique
frompd.cut
. \$\endgroup\$J R Chapman– J R Chapman2020年07月13日 10:18:16 +00:00Commented Jul 13, 2020 at 10:18 -
\$\begingroup\$ @JRChapman In that case you will have to pass
labels=False
(orNone
, not quite sure atm) and use the resulting indices to index intopd.Series(death_prob)
. See also the first revision of my answer for that. \$\endgroup\$Graipher– Graipher2020年07月13日 10:27:27 +00:00Commented Jul 13, 2020 at 10:27 -
\$\begingroup\$ @JRChapman: It is
False
, and here is the direct link to that revision: codereview.stackexchange.com/revisions/245225/1 \$\endgroup\$Graipher– Graipher2020年07月13日 13:30:41 +00:00Commented Jul 13, 2020 at 13:30