Return to Question

Bumped by Community user

occurred Apr 19, 2019 at 16:01

Bumped by Community user

occurred Dec 20, 2018 at 16:01

Bumped by Community user

occurred Nov 20, 2018 at 15:02

Bumped by Community user

occurred Oct 21, 2018 at 14:00

Bumped by Community user

occurred Sep 21, 2018 at 13:00

edited title

Link

edited Aug 22, 2018 at 9:54

Graipher

edited Aug 22, 2018 at 9:54

Graipher

41.6k
7
70
134

Speed up this python function on concordance Concordance index calculation

deleted 492 characters in body

Source Link

edited Aug 21, 2018 at 17:59

Munichong

edited Aug 21, 2018 at 17:59

Munichong

I am trying to calculate a customized concordance index for survival analysis. Below is my code. It runs well for small input dataframe but extremely slow on a dataframe with one million rows (>30min).

import pandas as pd
def c_indexc_index1(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 df = df.sort_values(by=['time'])
  for i, row in df.iterrows():
 if row['event'] == 01:
 comparable_rows = df[(df.index > i) & (df['event'] == 10) & (df['time'] <> row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] >< row['proba']])
 else:
 comparable_rows = df[(df.index > i) & (df['event'] == 0) & (df['time'] > row['time'])]
  n_total_correct += n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
  n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 0: retrieve all comparable rows whose
1. index is larger (avoid duplicate calculation),
2. event is 1, and
3. time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose
If the event of the row is 1: retrieve all comparable rows whose

index is larger (avoid duplicate calculation),
event is 0, and
time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the for loop. How should I speed up it?

import pandas as pd
def c_index(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 for i, row in df.iterrows():
 if row['event'] == 0:
 comparable_rows = df[(df.index > i) & (df['event'] == 1) & (df['time'] < row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] > row['proba']])
 else:
 comparable_rows = df[(df.index > i) & (df['event'] == 0) & (df['time'] > row['time'])]
  n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
  n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 0: retrieve all comparable rows whose
1. index is larger (avoid duplicate calculation),
2. event is 1, and
3. time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose

index is larger (avoid duplicate calculation),
event is 0, and
time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the for loop. How should I speed up it?

import pandas as pd
def c_index1(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 df = df.sort_values(by=['time'])
  for i, row in df.iterrows():
 if row['event'] == 1:
 comparable_rows = df[(df['event'] == 0) & (df['time'] > row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
 n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 1: retrieve all comparable rows whose

index is larger (avoid duplicate calculation),
event is 0, and
time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the for loop. How should I speed up it?

update list formatting

Source Link

edited Aug 21, 2018 at 15:58

Sᴀᴍ Onᴇᴌᴀ ♦

edited Aug 21, 2018 at 15:58

Sᴀᴍ Onᴇᴌᴀ ♦

29.5k
16
45
201

import pandas as pd
def c_index(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 for i, row in df.iterrows():
 if row['event'] == 0:
 comparable_rows = df[(df.index > i) & (df['event'] == 1) & (df['time'] < row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] > row['proba']])
 else:
 comparable_rows = df[(df.index > i) & (df['event'] == 0) & (df['time'] > row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
 n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 0: retrieve all comparable rows whose 1) index is larger (avoid duplicate calculation), 2) event is 1, and 3) time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 0: retrieve all comparable rows whose
1. index is larger (avoid duplicate calculation),
2. event is 1, and
3. time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose 1) index is larger (avoid duplicate calculation), 2) event is 0, and 3) time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose

index is larger (avoid duplicate calculation),
event is 0, and
time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the FORfor loop. How should I speed up it?

import pandas as pd
def c_index(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 for i, row in df.iterrows():
 if row['event'] == 0:
 comparable_rows = df[(df.index > i) & (df['event'] == 1) & (df['time'] < row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] > row['proba']])
 else:
 comparable_rows = df[(df.index > i) & (df['event'] == 0) & (df['time'] > row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
 n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 0: retrieve all comparable rows whose 1) index is larger (avoid duplicate calculation), 2) event is 1, and 3) time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose 1) index is larger (avoid duplicate calculation), 2) event is 0, and 3) time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the FOR loop. How should I speed up it?

import pandas as pd
def c_index(y_pred, events, times):
 df = pd.DataFrame(data={'proba':y_pred, 'event':events, 'time':times})
 n_total_correct = 0
 n_total_comparable = 0
 for i, row in df.iterrows():
 if row['event'] == 0:
 comparable_rows = df[(df.index > i) & (df['event'] == 1) & (df['time'] < row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] > row['proba']])
 else:
 comparable_rows = df[(df.index > i) & (df['event'] == 0) & (df['time'] > row['time'])]
 n_correct_rows = len(comparable_rows[comparable_rows['proba'] < row['proba']])
 n_total_correct += n_correct_rows
 n_total_comparable += len(comparable_rows)
 return n_total_correct / n_total_comparable if n_total_comparable else None
c = c_index([0.1, 0.3, 0.67, 0.45, 0.56], [1.0,0.0,1.0,0.0,1.0], [3.1,4.5,6.7,5.2,3.4])
print(c) # print 0.5

For each row (in case it matters...):

If the event of the row is 0: retrieve all comparable rows whose
1. index is larger (avoid duplicate calculation),
2. event is 1, and
3. time is less than the time of the current row. Out of the comparable rows, the rows whose probability is more than the current row are correct predictions.
If the event of the row is 1: retrieve all comparable rows whose

index is larger (avoid duplicate calculation),
event is 0, and
time is larger than the time of the current row. Out of the comparable rows, the rows whose probability is less than the current row are correct predictions.

I guess it is slow because of the for loop. How should I speed up it?

Source Link

asked Aug 21, 2018 at 15:13

Munichong

asked Aug 21, 2018 at 15:13

Munichong

lang-py