Construct a numpy array by repeating a 1 dimensional array sliced at different indices

Question 1

I have a pandas dataframe (called base_mortality) with 1 column and n rows, which is of the following form:

 age | death_prob 
---------------------------
 60 | 0.005925
 61 | 0.006656
 62 | 0.007474
 63 | 0.008387
 64 | 0.009405
 65 | 0.010539
 66 | 0.0118
 67 | 0.013201
 68 | 0.014756
 69 | 0.016477

age is the index and death_prob is the probability that a person who is a given age will die in the next year. I want to use these death probabilities to project the expected annuity payment that would be paid to an annuitant over the next t years.

Suppose I have 3 annuitants, whose names and ages are contained in a dictionary:

policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}

Then I would want to construct a new dataframe whose index is time (rather than age) which has 3 columns (one for each annuitant) and t rows (one for each time step). Each column should specify the probability of death for each policy holder at that time step. For example:

 John Mike Alan
0 0.010539 0.013201 0.020486
1 0.011800 0.014756 0.022807
2 0.013201 0.016477 0.025365
3 0.014756 0.018382 0.028179
4 0.016477 0.020486 0.031269
.. ... ... ...
96 1.000000 1.000000 1.000000
97 1.000000 1.000000 1.000000
98 1.000000 1.000000 1.000000
99 1.000000 1.000000 1.000000
100 1.000000 1.000000 1.000000

At present, my code for doing this is as follows:

import pandas as pd
base_mortality = pd.read_csv('/Users/joshchapman/PycharmProjects/VectorisedAnnuityModel/venv/assumptions/base_mortality.csv', index_col=['x'])
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
out = pd.DataFrame(index=range(0,101))
for name, age in policy_holders.items():
 out[name] = base_mortality.loc[age:].reset_index()['age']
out = out.fillna(1)
print(out)

However, my aim is to remove this loop and achieve this using vector operations (i.e. pandas and/or numpy functions). Any suggestions on how I might improve my code to work in this way would be great!

Question 2

Enter pandas.cut. It returns the bin in which each event lies. You can even pass the labels directly. This way you can reduce it to a Python loop over the people:

import pandas as pd
import numpy as np
age_bins = range(59, 70) # one more than the probabilities
death_prob = [0.005925, 0.006656, 0.007474, 0.008387, 0.009405, 0.010539, 0.0118,
 0.013201, 0.014756, 0.016477]
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
values = {name: pd.cut(range(age, age + 101), age_bins, labels=death_prob)
 for name, age in policy_holders.items()}
out = pd.DataFrame(values, dtype=np.float64).fillna(1)
print(out)
# John Mike Alan
# 0 0.010539 0.013201 1.0
# 1 0.011800 0.014756 1.0
# 2 0.013201 0.016477 1.0
# 3 0.014756 1.000000 1.0
# 4 0.016477 1.000000 1.0
# .. ... ... ...
# 96 1.000000 1.000000 1.0
# 97 1.000000 1.000000 1.0
# 98 1.000000 1.000000 1.0
# 99 1.000000 1.000000 1.0
# 100 1.000000 1.000000 1.0
# 
# [101 rows x 3 columns]

Note that the hin edges need to be one larger than the labels, because technically, this is interpreted as (59, 60], (60, 61], ..., i.e. including the right edge.

Question 3

Thanks for your help on this one! Quick question though: what if the probabilities are not unique? I've tried replacing the last probability with the second to last and this gives the error Categorical categories must be unique from pd.cut.

Question 4

@JRChapman In that case you will have to pass labels=False (or None, not quite sure atm) and use the resulting indices to index into pd.Series(death_prob). See also the first revision of my answer for that.

Question 5

@JRChapman: It is False, and here is the direct link to that revision: codereview.stackexchange.com/revisions/245225/1

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2020-07-09 12:51:13Z

Enter pandas.cut. It returns the bin in which each event lies. You can even pass the labels directly. This way you can reduce it to a Python loop over the people:

import pandas as pd
import numpy as np
age_bins = range(59, 70) # one more than the probabilities
death_prob = [0.005925, 0.006656, 0.007474, 0.008387, 0.009405, 0.010539, 0.0118,
 0.013201, 0.014756, 0.016477]
policy_holders = {'John' : 65, 'Mike': 67, 'Alan': 71}
values = {name: pd.cut(range(age, age + 101), age_bins, labels=death_prob)
 for name, age in policy_holders.items()}
out = pd.DataFrame(values, dtype=np.float64).fillna(1)
print(out)
# John Mike Alan
# 0 0.010539 0.013201 1.0
# 1 0.011800 0.014756 1.0
# 2 0.013201 0.016477 1.0
# 3 0.014756 1.000000 1.0
# 4 0.016477 1.000000 1.0
# .. ... ... ...
# 96 1.000000 1.000000 1.0
# 97 1.000000 1.000000 1.0
# 98 1.000000 1.000000 1.0
# 99 1.000000 1.000000 1.0
# 100 1.000000 1.000000 1.0
# 
# [101 rows x 3 columns]

Note that the hin edges need to be one larger than the labels, because technically, this is interpreted as (59, 60], (60, 61], ..., i.e. including the right edge.

Thanks for your help on this one! Quick question though: what if the probabilities are not unique? I've tried replacing the last probability with the second to last and this gives the error Categorical categories must be unique from pd.cut.
@JRChapman In that case you will have to pass labels=False (or None, not quite sure atm) and use the resulting indices to index into pd.Series(death_prob). See also the first revision of my answer for that.
@JRChapman: It is False, and here is the direct link to that revision: codereview.stackexchange.com/revisions/245225/1

Stack Exchange Network

Construct a numpy array by repeating a 1 dimensional array sliced at different indices

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Construct a numpy array by repeating a 1 dimensional array sliced at different indices

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions