Importing database of 4 million rows into Pandas DataFrame

Question 1

I am using the following code to import database table into a DataFrame:

def import_db_table(chunk_size, offset):
 dfs_ct = []
 j = 0
 start = dt.datetime.now()
 df = pd.DataFrame()
 while True:
 sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
 dfs_ct.append(psql.read_sql_query(sql_ct, connection))
 offset += chunk_size
 if len(dfs_ct[-1]) < chunk_size:
 break
 df = pd.concat(dfs_ct) 
 # Convert columns to datetime
 columns = ['col1', 'col2', 'col3','col4', 'col5', 'col6',
 'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
 'col13', 'col14', 'col15']
 for column in columns:
 df[column] = pd.to_datetime(df[column], errors='coerce')
 # Remove the uninteresting columns
 columns_remove = ['col42', 'col43', 'col67','col52', 'col39', 'col48','col49', 'col50', 'col60', 'col61', 'col62', 'col63', 'col64','col75', 'col80']
 for c in df.columns:
 if c not in columns_remove:
 df = df.drop(c, axis=1) 
 j+=1
 print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunk_size))
 return df

I am calling it with:

df = import_db_table(100000, 0)

This seems to be very slow - it starts with importing 100000 rows in 7 seconds but later after 1 million rows the number of seconds needed grows to 40-50 and more. Could this be improved somehow? I am using PostgreSQL, Python 3.5.

7 seconds: completed 100000 rows
17 seconds: completed 200000 rows
30 seconds: completed 300000 rows
47 seconds: completed 400000 rows
69 seconds: completed 500000 rows
92 seconds: completed 600000 rows
121 seconds: completed 700000 rows
153 seconds: completed 800000 rows
188 seconds: completed 900000 rows
228 seconds: completed 1000000 rows
271 seconds: completed 1100000 rows
318 seconds: completed 1200000 rows
368 seconds: completed 1300000 rows
422 seconds: completed 1400000 rows
480 seconds: completed 1500000 rows
540 seconds: completed 1600000 rows
605 seconds: completed 1700000 rows
674 seconds: completed 1800000 rows
746 seconds: completed 1900000 rows

Question 2

Shouldn't you reset dfs_ct every iteration of the while loop? Otherwise it looks like you add all previously added entries as well as the next chunk. This would explain why it gets slower and slower...

Question 3

@Graipher I think you are right, but I couldn't figure out the way to do it. Could you advise how?

Question 4

The easiest way would be to just call the concat once, after the loop. Alternatively, write df_chunk = psql.read_sql_query(sql_ct, connection); # check for abort condition; df = pd.concat(df, df_chunk) inside the loop. Doing it outside the loop will be faster (but will have a list of all chunk data frames in memory, just like your current code). Doing it inside the loop has the added overhead of calling the function everytime but only ever has one chunk in memory (and the total dataframe).

Question 5

Your code

def import_db_table(chunk_size, offset):

It doesn't look like you need to pass offset to this function. All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables.

 dfs_ct = []
 j = 0
 start = dt.datetime.now()
 df = pd.DataFrame()
 while True:
 sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
 dfs_ct.append(psql.read_sql_query(sql_ct, connection))
 offset += chunk_size
 if len(dfs_ct[-1]) < chunk_size:
 break

As written, the while loop should stop here. You can also get better performance by making a generator instead of a list out of the query results. For example:

Code suggestions

 def generate_df_pieces(connection, chunk_size, offset = 0):
 while True:
 sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
 df_piece = psql.read_sql_query(sql_ct, connection)
 # don't yield an empty data frame
 if not df_piece.shape[0]:
 break
 yield df_piece
 # don't make an unnecessary database query
 if df_piece.shape[0] < chunk_size:
 break
 offset += chunk_size

Then you can call:

 df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))

The function pd.concat can take a sequence. Making the sequence be a generator like this is more efficient than growing a list, as you don't need to keep more than one df_piece in memory until you actually make them into the final, larger one.

Back to your code

 df = pd.concat(dfs_ct)

You're resetting the entire dataframe each time and rebuilding it anew from the whole list! If this were outside of the loop it would make sense.

 # Convert columns to datetime
 columns = ['col1', 'col2', 'col3','col4', 'col5', 'col6',
 'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
 'col13', 'col14', 'col15']
 for column in columns:
 df[column] = pd.to_datetime(df[column], errors='coerce')
 # Remove the uninteresting columns
 columns_remove = ['col42', 'col43', 'col67','col52', 'col39', 'col48','col49', 'col50', 'col60', 'col61', 'col62', 'col63', 'col64','col75', 'col80']
 for c in df.columns:
 if c not in columns_remove:
 df = df.drop(c, axis=1)

This part could be done in the loop / generator function or outside. Dropping columns is a good thing to place inside as then the big dataframe you build won't ever need to be larger than you want. If you're able to put only the columns you want in the SQL query, that would be even better, as it would be less to send over the connection.

Another point to make about df.drop is that by default it makes a new dataframe. So use inplace = True so you don't copy your huge dataframe. And it also accepts a list of columns to be dropped:

Code suggestions

 df.drop(columns_remove, inplace = True, axis = 1)

gives the same result without looping and copying df over and over. You can also use:

 columns_remove_numbers = [ ... ] # list the column numbers
 columns_remove = df.columns[columns_remove_numbers]

So you don't have to type all those strings.

Back to your code

 j+=1
 print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunk_size))

If you use the generator function version of this, you could put this inside that function to keep track of the performance.

Question 6

Great, thanks for this. I am not really sure, where should I call this: df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))? Inside the generate_df_pieces method or outside? If inside, isn't it a recursive function? Also, if I do that with generators, when I try to apply some pandas operations on a generated dataframe, I get errors that the functions don't exist since I am not dealing with a pandas dataframe but a generator.

Question 7

Put that outside the function, you don't want it to be recursive. I tried to line up the indentation of the code so it would fit together. As for getting an error, what version of pandas do you have? It looks like support for generators in pd.concat was added in 0.15.2.

Question 8

Actually, it is fine now, don't know why I had that error earlier. Thank you, works much better now!

aes aesaes 2431 silver badge6 bronze badges · Accepted Answer · 2017-05-08 03:12:54Z

Your code

def import_db_table(chunk_size, offset):

It doesn't look like you need to pass offset to this function. All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables.

 dfs_ct = []
 j = 0
 start = dt.datetime.now()
 df = pd.DataFrame()
 while True:
 sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
 dfs_ct.append(psql.read_sql_query(sql_ct, connection))
 offset += chunk_size
 if len(dfs_ct[-1]) < chunk_size:
 break

As written, the while loop should stop here. You can also get better performance by making a generator instead of a list out of the query results. For example:

Code suggestions

 def generate_df_pieces(connection, chunk_size, offset = 0):
 while True:
 sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
 df_piece = psql.read_sql_query(sql_ct, connection)
 # don't yield an empty data frame
 if not df_piece.shape[0]:
 break
 yield df_piece
 # don't make an unnecessary database query
 if df_piece.shape[0] < chunk_size:
 break
 offset += chunk_size

Then you can call:

 df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))

The function pd.concat can take a sequence. Making the sequence be a generator like this is more efficient than growing a list, as you don't need to keep more than one df_piece in memory until you actually make them into the final, larger one.

Back to your code

 df = pd.concat(dfs_ct)

You're resetting the entire dataframe each time and rebuilding it anew from the whole list! If this were outside of the loop it would make sense.

 # Convert columns to datetime
 columns = ['col1', 'col2', 'col3','col4', 'col5', 'col6',
 'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
 'col13', 'col14', 'col15']
 for column in columns:
 df[column] = pd.to_datetime(df[column], errors='coerce')
 # Remove the uninteresting columns
 columns_remove = ['col42', 'col43', 'col67','col52', 'col39', 'col48','col49', 'col50', 'col60', 'col61', 'col62', 'col63', 'col64','col75', 'col80']
 for c in df.columns:
 if c not in columns_remove:
 df = df.drop(c, axis=1)

This part could be done in the loop / generator function or outside. Dropping columns is a good thing to place inside as then the big dataframe you build won't ever need to be larger than you want. If you're able to put only the columns you want in the SQL query, that would be even better, as it would be less to send over the connection.

Another point to make about df.drop is that by default it makes a new dataframe. So use inplace = True so you don't copy your huge dataframe. And it also accepts a list of columns to be dropped:

Code suggestions

 df.drop(columns_remove, inplace = True, axis = 1)

gives the same result without looping and copying df over and over. You can also use:

 columns_remove_numbers = [ ... ] # list the column numbers
 columns_remove = df.columns[columns_remove_numbers]

So you don't have to type all those strings.

Back to your code

 j+=1
 print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunk_size))

If you use the generator function version of this, you could put this inside that function to keep track of the performance.

Great, thanks for this. I am not really sure, where should I call this: df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))? Inside the generate_df_pieces method or outside? If inside, isn't it a recursive function? Also, if I do that with generators, when I try to apply some pandas operations on a generated dataframe, I get errors that the functions don't exist since I am not dealing with a pandas dataframe but a generator.
Put that outside the function, you don't want it to be recursive. I tried to line up the indentation of the code so it would fit together. As for getting an error, what version of pandas do you have? It looks like support for generators in pd.concat was added in 0.15.2.
Actually, it is fine now, don't know why I had that error earlier. Thank you, works much better now!

Stack Exchange Network

Importing database of 4 million rows into Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Importing database of 4 million rows into Pandas DataFrame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions