Creating a list from combinations of the original data

Question 1

I am trying to create a list based on some data, but the code I am using is very slow when I run it on large data. So I suspect I am not using all of the Python power for this task. Is there a more efficient and faster way of doing this in Python?

Here an explanantion of the code:

You can think of this problem as a list of games (list_type) each with a list of participating teams and the scores for each team in the game (list_xx).For each of the pairs in the current game it first calculate the sum of the differences in score from the previous competitions (win_comp_past_difs); including only the pairs in the current game. Then it update each pair in the current game with the difference in scores. Using a defaultdict keeps track of the scores for each pair in each game and update this score as each game is played.

In the example below, based on some data, there are for-loops used to create a new variable list_zz.

The data and the for-loop code:

import pandas as pd
import numpy as np
from collections import defaultdict
from itertools import permutations
list_type = [['A', 'B'], ['B'], ['A', 'B', 'C', 'D', 'E'], ['B'], ['A', 'B', 'C'], ['A'], ['B', 'C'], ['A', 'B'], ['C', 'A', 'B'], ['A'], ['B', 'C']]
list_xx = [[1.0, 5.0], [3.0], [2.0, 7.0, 3.0, 1.0, 6.0], [3.0], [5.0, 2.0, 3.0], [1.0], [9.0, 3.0], [2.0, 7.0], [3.0, 6.0, 8.0], [2.0], [7.0, 9.0]]
list_zz= []
#for-loop
wd = defaultdict(float)
for i, x in zip(list_type, list_xx):
 # staff 1
 if len(i) == 1:
 #print('NaN')
 list_zz.append(np.nan)
 continue
 # Pairs and difference generator for current game (i)
 pairs = list(permutations(i, 2))
 dgen = (value[0] - value[1] for value in permutations(x, 2))
 # Sum of differences from previous games incluiding only pair of teams in the current game
 for team, result in zip(i, x):
 win_comp_past_difs = sum(wd[key] for key in pairs if key[0] == team)
 #print(win_comp_past_difs)
 list_zz.append(win_comp_past_difs)
 # Update pair differences for current game
 for pair, diff in zip(pairs, dgen):
 wd[pair] += diff
print(list_zz)

Which looks like this:

[0.0,
 0.0,
 nan,
 -4.0,
 4.0,
 0.0,
 0.0,
 0.0,
 nan,
 -10.0,
 13.0,
 -3.0,
 nan,
 3.0,
 -3.0,
 -6.0,
 6.0,
 -10.0,
 -10.0,
 20.0,
 nan,
 14.0,
 -14.0]

If you could elaborate on the code to make it more efficient and execute faster, I would really appreciate it.

Question 2

Why import pandas? You don’t use it. Why import numpy just for nan, when you could simply use math.nan?

Question 3

I usually put it in my code in case I try Pandas or Numpy codes. This time I used numpy, but I also tried things using Pandas.

Question 4

Welcome to Code Review! Please edit your question so that the title describes the purpose of the code, rather than its mechanism. We really need to understand the motivational context to give good reviews. Thanks!

Question 5

I don't really understand what your output is describing - can you elaborate a little bit more on what those values are representing, please?

Question 6

These are games, where each team has score in a competition. The competition can be with many different players or even only one. What my code does is to create an indicator of the difference of the score in the past with the players that are playing.

Question 7

The main way to do this faster is to use fewer loops and to do more in each loop. (AFAICT)

loops in there...

A (for event)
A.1 (make team pairs)
A.2 (make dgen)
A.3 (for team, result in event)
A.3.1.1 (for key in pairs)
A.3.1.2 (sum)
A.4 (for pair diff)

You have a lot more flexibility if you use the for loop instead of mostly comprehensions and permutations. And, permutations is wasteful, here, (as discussed, below) because it prevents a few optimisations.

A key change is we can do A vs B and B vs A in the same pass of the loop. So, we can change our loop from "for each team go through all teams" to "for each team go through all following teams". It's a big improvement.

For the line-up A, B, C, D, our iterations look like:

AB & BA
AC & CA
AD & DA

.

BC & CB
BD & DB

.

CD & DC
(done)

So, the algorithm might end up something like (in pseudo code):

past_score_differences = defaultDict(float)
for each event:
 team_outcomes = defaultDict(float)
 teams_len = len(teams in event)
 for i in range(teams_len - 1):
 home_team = teams[i]
 home_score = scores[i]
 for j in range(i+1, teams_len):
 away_team = teams[j]
 away_score = scores[j]
 team_outcomes[home_team] += past_score_differences[(home_team, away_team)]
 team_outcomes[away_team] += past_score_differences[(away_team, home_team)]
 past_score_differences[(home_team, away_team)] += home_score - away_score
 past_score_differences[(away_team, home_team)] += away_score - home_score
 list_zz.append(team_outcomes[home_team])
 # the last team doesn't go through the outer loop
 list_zz.append(team_outcomes[teams[-1]])

Still a lot of loops, but far fewer and with fewer iterations for the innermost loop.

If there is a lot of consistency in the team IDs, then you might consider not using a tuple as the key but instead using a dict within a dict, e.g.

example_past_differences = {
 'A': {
 'B': 10,
 'C': 11,
 'Etc...': 99
 }
}
# later, when accessing
home_team = ...
home_score = ...
home_past_differences = past_score_differences[home_team]
for ...
 away_past_differences = past_score_differences[away_team]
 ...
 team_outcomes[home_team] += home_past_differences[away_team]
 team_outcomes[away_team] += away_past_differences[home_team]

One last thing, your variable names are not easy to understand. I made mine way too long for accesibility to understand, but generally I think the guidance in these slides is very useful: https://talks.golang.org/2014/names.slide#1

Question 8

Thank you for your answer tiffon. Is it possible that you edit your answer with the names i have in my question. I get a little confused. Thanks!

Question 9

Took a guess at better variable names, but no idea what list_zz or wd should be.

Switched to combinations() instead of permutations() and process each player pair both ways (e.g. ('A','B') and ('B','A')).

Do combinations of player,score tuples, so there is only one call to combinations().

Use nested dicts instead of dicts keyed by a tuple of players.

Counter() is handy because .update() adds values rather than replacing them.

The code should be a function (or method).

from collections import Counter, defaultdict
from itertools import combinations
import math
# test data
games = [['A', 'B'], ['B'], ['A', 'B', 'C', 'D', 'E'], ['B'], ['A', 'B', 'C'], ['A'], ['B', 'C'], ['A', 'B'], ['C', 'A', 'B'], ['A'], ['B', 'C']]
gamescores = [[1.0, 5.0], [3.0], [2.0, 7.0, 3.0, 1.0, 6.0], [3.0], [5.0, 2.0, 3.0], [1.0], [9.0, 3.0], [2.0, 7.0], [3.0, 6.0, 8.0], [2.0], [7.0, 9.0]]
list_zz= []
wd = defaultdict(Counter)
past_diffs = defaultdict(float)
this_diff = defaultdict(Counter)
for players, scores in zip(games, gamescores):
 if len(players) == 1:
 list_zz.append(math.nan)
 continue
 past_diffs.clear()
 this_diff.clear()
 for (player1, score1), (player2, score2) in combinations(zip(players, scores), 2):
 past_diffs[player1] += wd[player1][player2]
 past_diffs[player2] += wd[player2][player1]
 this_diff[player1][player2] = score1 - score2
 this_diff[player2][player1] = score2 - score1
 list_zz.extend(past_diffs[p] for p in players)
 for player in players:
 wd[player].update(this_diff[player])
print(list_zz)

tiffon tiffon 6414 silver badges9 bronze badges · Answer 1 · 2020-02-20 10:51:59Z

The main way to do this faster is to use fewer loops and to do more in each loop. (AFAICT)

loops in there...

A (for event)
A.1 (make team pairs)
A.2 (make dgen)
A.3 (for team, result in event)
A.3.1.1 (for key in pairs)
A.3.1.2 (sum)
A.4 (for pair diff)

You have a lot more flexibility if you use the for loop instead of mostly comprehensions and permutations. And, permutations is wasteful, here, (as discussed, below) because it prevents a few optimisations.

A key change is we can do A vs B and B vs A in the same pass of the loop. So, we can change our loop from "for each team go through all teams" to "for each team go through all following teams". It's a big improvement.

For the line-up A, B, C, D, our iterations look like:

AB & BA
AC & CA
AD & DA

.

BC & CB
BD & DB

.

CD & DC
(done)

So, the algorithm might end up something like (in pseudo code):

past_score_differences = defaultDict(float)
for each event:
 team_outcomes = defaultDict(float)
 teams_len = len(teams in event)
 for i in range(teams_len - 1):
 home_team = teams[i]
 home_score = scores[i]
 for j in range(i+1, teams_len):
 away_team = teams[j]
 away_score = scores[j]
 team_outcomes[home_team] += past_score_differences[(home_team, away_team)]
 team_outcomes[away_team] += past_score_differences[(away_team, home_team)]
 past_score_differences[(home_team, away_team)] += home_score - away_score
 past_score_differences[(away_team, home_team)] += away_score - home_score
 list_zz.append(team_outcomes[home_team])
 # the last team doesn't go through the outer loop
 list_zz.append(team_outcomes[teams[-1]])

Still a lot of loops, but far fewer and with fewer iterations for the innermost loop.

If there is a lot of consistency in the team IDs, then you might consider not using a tuple as the key but instead using a dict within a dict, e.g.

example_past_differences = {
 'A': {
 'B': 10,
 'C': 11,
 'Etc...': 99
 }
}
# later, when accessing
home_team = ...
home_score = ...
home_past_differences = past_score_differences[home_team]
for ...
 away_past_differences = past_score_differences[away_team]
 ...
 team_outcomes[home_team] += home_past_differences[away_team]
 team_outcomes[away_team] += away_past_differences[home_team]

One last thing, your variable names are not easy to understand. I made mine way too long for accesibility to understand, but generally I think the guidance in these slides is very useful: https://talks.golang.org/2014/names.slide#1

Thank you for your answer tiffon. Is it possible that you edit your answer with the names i have in my question. I get a little confused. Thanks!

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 2 · 2020-02-20 16:11:16Z

Took a guess at better variable names, but no idea what list_zz or wd should be.

Switched to combinations() instead of permutations() and process each player pair both ways (e.g. ('A','B') and ('B','A')).

Do combinations of player,score tuples, so there is only one call to combinations().

Use nested dicts instead of dicts keyed by a tuple of players.

Counter() is handy because .update() adds values rather than replacing them.

The code should be a function (or method).

from collections import Counter, defaultdict
from itertools import combinations
import math
# test data
games = [['A', 'B'], ['B'], ['A', 'B', 'C', 'D', 'E'], ['B'], ['A', 'B', 'C'], ['A'], ['B', 'C'], ['A', 'B'], ['C', 'A', 'B'], ['A'], ['B', 'C']]
gamescores = [[1.0, 5.0], [3.0], [2.0, 7.0, 3.0, 1.0, 6.0], [3.0], [5.0, 2.0, 3.0], [1.0], [9.0, 3.0], [2.0, 7.0], [3.0, 6.0, 8.0], [2.0], [7.0, 9.0]]
list_zz= []
wd = defaultdict(Counter)
past_diffs = defaultdict(float)
this_diff = defaultdict(Counter)
for players, scores in zip(games, gamescores):
 if len(players) == 1:
 list_zz.append(math.nan)
 continue
 past_diffs.clear()
 this_diff.clear()
 for (player1, score1), (player2, score2) in combinations(zip(players, scores), 2):
 past_diffs[player1] += wd[player1][player2]
 past_diffs[player2] += wd[player2][player1]
 this_diff[player1][player2] = score1 - score2
 this_diff[player2][player1] = score2 - score1
 list_zz.extend(past_diffs[p] for p in players)
 for player in players:
 wd[player].update(this_diff[player])
print(list_zz)

Stack Exchange Network

Creating a list from combinations of the original data

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating a list from combinations of the original data

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions