Convert list of dictionaries to iterable list

Question 1

Due to another module coupled to my function, I can only receive the input to my part in the form of a JSON object structured roughly like this:

[{'id':0, 'y':4, 'value':25},{'id':0, 'y':2, 'value':254}]

Note that I do know that the data will arrive in exactly that format. Now, I need to cast this to lists such that I can pass it to the constructor of scipy.sparse.coo_matrix(). Since I have frequent incoming calls, I want to perform this cast as quickly as possible, which is why I am concerned with the optimization of this operation. Below are three different approaches that come up with this type of question. Note that the casting itself has been frequently addressed on Stackoverflow, even with respect to performance in some cases (mostly in terms of list comprehension), but I could not find anything that would give me an optimal solution.

To quickly address the three different methods I use:

evaluate each line as a tuple of the dictionary values. Comparable in speed with something like [[el['id'], el['y'], el['values']] for el in x].
Let pandas do the casting for you. Extremly slow.
Cast three separate lists. Since the allocation of lists in list comprehensions is way slower (compare [[el['id']] for el in x] to [el['id'] for el in x]), this seems to be the currently best-performing solution.

According to the articles I found, list comprehensions outperform any python-native method using .append(), but I might add an example and timing for that later.

The benchmarks are as follows:

import timeit as ti
# 2.107 seconds
print(ti.timeit("z = [tuple(el.values()) for el in x]", 
 setup="import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
 number=100))
# 8.93 seconds
print(ti.timeit("z = pd.DataFrame(x)", 
 setup="import pandas as pd; import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
 number=100))
# 0.717 seconds
print(ti.timeit("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]",
 setup="import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
 number=100))

I also include the "raw code" for the three snippets:

import random
import pandas as pd
if __name__ == "__main__":
 # ignore the fact that this actually isn't random for individual values
 x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000
 # first method
 z1 = [tuple(el.values()) for el in x]
 # second method
 z2 = pd.DataFrame(x)
 # third method
 z_3a = [el['id'] for el in x]
 z_3b = [el['y'] for el in x]
 z_3c = [el['value'] for el in x]

The question is whether there is any significant improvement on this (maybe by using a specialized library I don't know of, or any trick with numpy, etc.) to easily improve the speed on this. I'm currently assuming that, following the 80/20 principle, it is unlikely I'll get more performance out of this without spending a lot more effort on it...

Question 2

Hi, it looks like you're asking for a comparative review, however it's hard to tell this from the code that you've posted. Would you be willing to post the three different methods you're using as if it were a Python script, as opposed to three timeits.

Question 3

Your example is missing import statements.

Question 4

I added the relevant parts as an "explicit code fragment" as well.

Question 5

~~(削除) Does the scipy.sparse.coo_matrix() function accept three different lists as parameters? (削除ここまで)~~

~~(削除) Maybe I'm missing something but your "fastest" method is not actually the fastest, since you'd still need some conversion to parse it to the function anyhow. (削除ここまで)~~
I think your timing is off

For this one z = [tuple(el.values()) for el in x] you say:

evaluate each line as a tuple of the dictionary values. Comparable in speed with something like [[el['id'], el['y'], el['values']] for el in x].

I have tested this myself, and it seems the second suggestion is 2 times as fast as the first method.
```
Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
z = [tuple(el.values()) for el in x]
3.070572477
z1 = [[el['id'], el['y'], el['value']] for el in x]
1.5449943620000006
```

This could also be a generator

def gen_vals(d):
 for el in d:
 yield [el['id'], el['y'], el['value']]

However this did not improve speed, but just to bounce some ideas around

Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]
1.244481049000001
z1 = [[el['id'], el['y'], el['value']] for el in x]
1.5449943620000006
z1 = list(gen_vals(x))
1.8347137390000015
z1 = [a for a in gen_vals(x)]
2.018111815000001

Timing code for reference

import timeit as ti
import random
def gen_vals(d):
 for el in d:
 yield [el['id'], el['y'], el['value']]
# SETUP
x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
print("Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000")
print("z = [tuple(el.values()) for el in x]")
print(ti.timeit("z = [tuple(el.values()) for el in x]", 
 setup="from __main__ import x",
 number=100))
print("z = pd.DataFrame(x)")
print(ti.timeit("z = pd.DataFrame(x)", 
 setup="import pandas as pd; from __main__ import x",
 number=100))
print("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]")
print(ti.timeit("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]",
 setup="from __main__ import x",
 number=100))
print("z1 = [[el['id'], el['y'], el['value']] for el in x]")
print(ti.timeit("z1 = [[el['id'], el['y'], el['value']] for el in x]",
 setup="from __main__ import x",
 number=100))
print("z1 = list(gen_vals(x))")
print(ti.timeit("z1 = list(gen_vals(x))",
 setup="from __main__ import x, gen_vals",
 number=100))
print("z1 = [a for a in gen_vals(x)]")
print(ti.timeit("z1 = [a for a in gen_vals(x)]",
 setup="from __main__ import x, gen_vals",
 number=100))

Question 6

The constructor for coo_matrix() (for a "from scratch" creation) accepts the form of (<value_list>, (<x_coordinate_list>, <y_coordinate_list>)), so those are generally three separate lists. Also thanks for the detailed comparison, including the idea with the generator!

Ludisposed Ludisposed 11.8k2 gold badges41 silver badges91 bronze badges · Accepted Answer · 2018-09-05 08:46:41Z

~~(削除) Does the scipy.sparse.coo_matrix() function accept three different lists as parameters? (削除ここまで)~~

~~(削除) Maybe I'm missing something but your "fastest" method is not actually the fastest, since you'd still need some conversion to parse it to the function anyhow. (削除ここまで)~~
I think your timing is off

For this one z = [tuple(el.values()) for el in x] you say:

evaluate each line as a tuple of the dictionary values. Comparable in speed with something like [[el['id'], el['y'], el['values']] for el in x].

I have tested this myself, and it seems the second suggestion is 2 times as fast as the first method.
```
Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
z = [tuple(el.values()) for el in x]
3.070572477
z1 = [[el['id'], el['y'], el['value']] for el in x]
1.5449943620000006
```

This could also be a generator

def gen_vals(d):
 for el in d:
 yield [el['id'], el['y'], el['value']]

However this did not improve speed, but just to bounce some ideas around

Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]
1.244481049000001
z1 = [[el['id'], el['y'], el['value']] for el in x]
1.5449943620000006
z1 = list(gen_vals(x))
1.8347137390000015
z1 = [a for a in gen_vals(x)]
2.018111815000001

Timing code for reference

import timeit as ti
import random
def gen_vals(d):
 for el in d:
 yield [el['id'], el['y'], el['value']]
# SETUP
x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
print("Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000")
print("z = [tuple(el.values()) for el in x]")
print(ti.timeit("z = [tuple(el.values()) for el in x]", 
 setup="from __main__ import x",
 number=100))
print("z = pd.DataFrame(x)")
print(ti.timeit("z = pd.DataFrame(x)", 
 setup="import pandas as pd; from __main__ import x",
 number=100))
print("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]")
print(ti.timeit("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]",
 setup="from __main__ import x",
 number=100))
print("z1 = [[el['id'], el['y'], el['value']] for el in x]")
print(ti.timeit("z1 = [[el['id'], el['y'], el['value']] for el in x]",
 setup="from __main__ import x",
 number=100))
print("z1 = list(gen_vals(x))")
print(ti.timeit("z1 = list(gen_vals(x))",
 setup="from __main__ import x, gen_vals",
 number=100))
print("z1 = [a for a in gen_vals(x)]")
print(ti.timeit("z1 = [a for a in gen_vals(x)]",
 setup="from __main__ import x, gen_vals",
 number=100))

The constructor for coo_matrix() (for a "from scratch" creation) accepts the form of (<value_list>, (<x_coordinate_list>, <y_coordinate_list>)), so those are generally three separate lists. Also thanks for the detailed comparison, including the idea with the generator!

Stack Exchange Network

Convert list of dictionaries to iterable list

1 Answer 1

Timing code for reference

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Convert list of dictionaries to iterable list

1 Answer 1

Timing code for reference

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions