Loop to extract json from dataframe and storing in a new dataframe

Asked 6 years ago

Viewed 3k times

\$\begingroup\$

I have a dataframe (obtained from a csv saved from mysql) with several columns, and one of them consist of a string which is the representation of a json. The data looks like:

id email_id provider raw_data ts
1 [email protected] A {'a':'A', 2019-23-08 00:00:00
 'b':'B',
 'c':'C'}

And what my desired output is:

email_id a b c 
[email protected] A B C

What I have coded so far is the following:

import pandas as pd
import ast
df = pd.read_csv('data.csv')
df1 = pd.DataFrame()
for i in range(len(df)):
 dict_1 = ast.literal_eval(df['raw_content'][i])
 df1 = df1.append(pd.Series(dict_1),ignore_index=True)
pd.concat([df['email_id'],df1])

This works but it has a very big problem: it is extremely low (it takes hours for 100k rows). How could I make this operation faster?

edited Aug 28, 2019 at 9:05

Javier Lopez TomasJavier Lopez Tomas

asked Aug 26, 2019 at 10:43

Javier Lopez Tomas's user avatar

Javier Lopez Tomas Javier Lopez Tomas

2862 silver badges8 bronze badges

\$\endgroup\$

2

\$\begingroup\$ Would json.loads be any faster? With a more limited structure it might. Another thing to consider is doing a string join on all those raw_data strings, and calling loads once. I haven't worked with json enough to know what's fast or slow in its parsing. \$\endgroup\$

hpaulj
– hpaulj

2019年08月26日 20:32:25 +00:00
Commented Aug 26, 2019 at 20:32
1

\$\begingroup\$ Is the above exactly how your file looks? \$\endgroup\$

C.Nivs
– C.Nivs

2019年08月27日 20:53:52 +00:00
Commented Aug 27, 2019 at 20:53
\$\begingroup\$ I forgot to close the json, sorry \$\endgroup\$

Javier Lopez Tomas
– Javier Lopez Tomas

2019年08月28日 09:05:30 +00:00
Commented Aug 28, 2019 at 9:05

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Finally I got an amazing improvement thanks to stack overflow, regarding two things: https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe https://stackoverflow.com/questions/37757844/pandas-df-locz-x-y-how-to-improve-speed

Also, as hpaulj pointed, changing to json.loads slightly increases the performance.

It went from 16 hours to 30 seconds

row_list = []
for i in range(len(df)):
 dict1 = {}
 dict1.update(json.loads(df.at[i,'raw_content']))
 row_list.append(dict1)
df1 = pd.DataFrame(row_list)
df2 = pd.concat([df['email_id'],df1],axis=1)

edited Aug 28, 2019 at 13:52

answered Aug 28, 2019 at 9:08

Javier Lopez Tomas's user avatar

Javier Lopez Tomas Javier Lopez Tomas

2862 silver badges8 bronze badges

\$\endgroup\$

\$\begingroup\$ "got an amazing improvement thanks to stack overflow" Do you have a link to that answer? \$\endgroup\$

Mast
– Mast ♦

2019年08月28日 12:56:50 +00:00
Commented Aug 28, 2019 at 12:56
1

\$\begingroup\$ Yes, I have edited the answer to add those links \$\endgroup\$

Javier Lopez Tomas
– Javier Lopez Tomas

2019年08月28日 13:53:12 +00:00
Commented Aug 28, 2019 at 13:53

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Loop to extract json from dataframe and storing in a new dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Loop to extract json from dataframe and storing in a new dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions