Due to another module coupled to my function, I can only receive the input to my part in the form of a JSON object structured roughly like this:
[{'id':0, 'y':4, 'value':25},{'id':0, 'y':2, 'value':254}]
Note that I do know that the data will arrive in exactly that format. Now, I need to cast this to lists such that I can pass it to the constructor of scipy.sparse.coo_matrix()
.
Since I have frequent incoming calls, I want to perform this cast as quickly as possible, which is why I am concerned with the optimization of this operation. Below are three different approaches that come up with this type of question. Note that the casting itself has been frequently addressed on Stackoverflow, even with respect to performance in some cases (mostly in terms of list comprehension), but I could not find anything that would give me an optimal solution.
To quickly address the three different methods I use:
- evaluate each line as a tuple of the dictionary values. Comparable in speed with something like
[[el['id'], el['y'], el['values']] for el in x]
. - Let
pandas
do the casting for you. Extremly slow. - Cast three separate lists. Since the allocation of lists in list comprehensions is way slower (compare
[[el['id']] for el in x]
to[el['id'] for el in x]
), this seems to be the currently best-performing solution.
According to the articles I found, list comprehensions outperform any python-native method using .append()
, but I might add an example and timing for that later.
The benchmarks are as follows:
import timeit as ti
# 2.107 seconds
print(ti.timeit("z = [tuple(el.values()) for el in x]",
setup="import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
number=100))
# 8.93 seconds
print(ti.timeit("z = pd.DataFrame(x)",
setup="import pandas as pd; import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
number=100))
# 0.717 seconds
print(ti.timeit("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]",
setup="import random; x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000",
number=100))
I also include the "raw code" for the three snippets:
import random
import pandas as pd
if __name__ == "__main__":
# ignore the fact that this actually isn't random for individual values
x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*60000
# first method
z1 = [tuple(el.values()) for el in x]
# second method
z2 = pd.DataFrame(x)
# third method
z_3a = [el['id'] for el in x]
z_3b = [el['y'] for el in x]
z_3c = [el['value'] for el in x]
The question is whether there is any significant improvement on this (maybe by using a specialized library I don't know of, or any trick with numpy, etc.) to easily improve the speed on this. I'm currently assuming that, following the 80/20 principle, it is unlikely I'll get more performance out of this without spending a lot more effort on it...
-
1\$\begingroup\$ Hi, it looks like you're asking for a comparative review, however it's hard to tell this from the code that you've posted. Would you be willing to post the three different methods you're using as if it were a Python script, as opposed to three timeits. \$\endgroup\$Peilonrayz– Peilonrayz ♦2018年09月04日 12:25:29 +00:00Commented Sep 4, 2018 at 12:25
-
\$\begingroup\$ Your example is missing import statements. \$\endgroup\$Mast– Mast ♦2018年09月04日 12:25:31 +00:00Commented Sep 4, 2018 at 12:25
-
\$\begingroup\$ I added the relevant parts as an "explicit code fragment" as well. \$\endgroup\$dennlinger– dennlinger2018年09月04日 12:45:01 +00:00Commented Sep 4, 2018 at 12:45
1 Answer 1
(削除) Does thescipy.sparse.coo_matrix()
function accept three different lists as parameters? (削除ここまで)(削除) Maybe I'm missing something but your "fastest" method is not actually the fastest, since you'd still need some conversion to parse it to the function anyhow. (削除ここまで)I think your timing is off
For this one
z = [tuple(el.values()) for el in x]
you say:evaluate each line as a tuple of the dictionary values. Comparable in speed with something like
[[el['id'], el['y'], el['values']] for el in x]
.I have tested this myself, and it seems the second suggestion is 2 times as fast as the first method.
Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000 z = [tuple(el.values()) for el in x] 3.070572477 z1 = [[el['id'], el['y'], el['value']] for el in x] 1.5449943620000006
This could also be a generator
def gen_vals(d): for el in d: yield [el['id'], el['y'], el['value']]
However this did not improve speed, but just to bounce some ideas around
Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000 z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x] 1.244481049000001 z1 = [[el['id'], el['y'], el['value']] for el in x] 1.5449943620000006 z1 = list(gen_vals(x)) 1.8347137390000015 z1 = [a for a in gen_vals(x)] 2.018111815000001
Timing code for reference
import timeit as ti
import random
def gen_vals(d):
for el in d:
yield [el['id'], el['y'], el['value']]
# SETUP
x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000
print("Setup: x = [{'id':random.randint(0,5),'y':random.randint(0,5), 'value':random.randint(0,500)}]*100000")
print("z = [tuple(el.values()) for el in x]")
print(ti.timeit("z = [tuple(el.values()) for el in x]",
setup="from __main__ import x",
number=100))
print("z = pd.DataFrame(x)")
print(ti.timeit("z = pd.DataFrame(x)",
setup="import pandas as pd; from __main__ import x",
number=100))
print("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]")
print(ti.timeit("z1 = [el['id'] for el in x]; z2 = [el['y'] for el in x]; z3 = [el['value'] for el in x]",
setup="from __main__ import x",
number=100))
print("z1 = [[el['id'], el['y'], el['value']] for el in x]")
print(ti.timeit("z1 = [[el['id'], el['y'], el['value']] for el in x]",
setup="from __main__ import x",
number=100))
print("z1 = list(gen_vals(x))")
print(ti.timeit("z1 = list(gen_vals(x))",
setup="from __main__ import x, gen_vals",
number=100))
print("z1 = [a for a in gen_vals(x)]")
print(ti.timeit("z1 = [a for a in gen_vals(x)]",
setup="from __main__ import x, gen_vals",
number=100))
-
\$\begingroup\$ The constructor for
coo_matrix()
(for a "from scratch" creation) accepts the form of(<value_list>, (<x_coordinate_list>, <y_coordinate_list>))
, so those are generally three separate lists. Also thanks for the detailed comparison, including the idea with the generator! \$\endgroup\$dennlinger– dennlinger2018年09月05日 08:53:12 +00:00Commented Sep 5, 2018 at 8:53