I have a dataframe (obtained from a csv saved from mysql) with several columns, and one of them consist of a string which is the representation of a json. The data looks like:
id email_id provider raw_data ts
1 [email protected] A {'a':'A', 2019-23-08 00:00:00
'b':'B',
'c':'C'}
And what my desired output is:
email_id a b c
[email protected] A B C
What I have coded so far is the following:
import pandas as pd
import ast
df = pd.read_csv('data.csv')
df1 = pd.DataFrame()
for i in range(len(df)):
dict_1 = ast.literal_eval(df['raw_content'][i])
df1 = df1.append(pd.Series(dict_1),ignore_index=True)
pd.concat([df['email_id'],df1])
This works but it has a very big problem: it is extremely low (it takes hours for 100k rows). How could I make this operation faster?
1 Answer 1
Finally I got an amazing improvement thanks to stack overflow, regarding two things: https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe https://stackoverflow.com/questions/37757844/pandas-df-locz-x-y-how-to-improve-speed
Also, as hpaulj pointed, changing to json.loads slightly increases the performance.
It went from 16 hours to 30 seconds
row_list = []
for i in range(len(df)):
dict1 = {}
dict1.update(json.loads(df.at[i,'raw_content']))
row_list.append(dict1)
df1 = pd.DataFrame(row_list)
df2 = pd.concat([df['email_id'],df1],axis=1)
-
\$\begingroup\$ "got an amazing improvement thanks to stack overflow" Do you have a link to that answer? \$\endgroup\$2019年08月28日 12:56:50 +00:00Commented Aug 28, 2019 at 12:56
-
1\$\begingroup\$ Yes, I have edited the answer to add those links \$\endgroup\$Javier Lopez Tomas– Javier Lopez Tomas2019年08月28日 13:53:12 +00:00Commented Aug 28, 2019 at 13:53
json.loads
be any faster? With a more limited structure it might. Another thing to consider is doing a string join on all thoseraw_data
strings, and callingloads
once. I haven't worked withjson
enough to know what's fast or slow in its parsing. \$\endgroup\$