A custom Pandas dataframe to_string method

Question 1

Oftentimes I find myself converting pandas.DataFrame objects to lists of formatted row strings, so I can print the rows into, e.g. a tkinter.Listbox. To do this, I have been utilizing pandas.DataFrame.to_string. There is a lot of nice functionality built into the method, but when the number of dataframe rows/columns gets relatively large, to_string starts to tank.

Below I implement a custom pandas.DataFrame class with a few added methods for returning formatted row lines. I am looking to improve upon the get_lines_fast_struct method.

import pandas
class DataFrame2(pandas.DataFrame):
 def __init__( self, *args, **kwargs ):
 pandas.DataFrame.__init__(self, *args, **kwargs)
 def get_lines_standard(self):
 """standard way to convert pandas dataframe
 to lines with fomrmatted column spacing"""
 lines = self.to_string(index=False).split('\n')
 return lines
 def get_lines_fast_unstruct(self):
 """ lighter version of pandas.DataFrame.to_string()
 with no special spacing format"""
 df_recs = self.to_records(index=False)
 col_titles = [' '.join(list(self))]
 col_data = map(lambda rec:' '.join( map(str,rec) ), 
 df_recs.tolist())
 lines = col_titles + col_data
 return lines
 def get_lines_fast_struct(self,col_space=1):
 """ lighter version of pandas.DataFrame.to_string()
 with special spacing format"""
 df_recs = self.to_records(index=False) # convert dataframe to array of records
 str_data = map(lambda rec: map(str,rec), df_recs ) # map each element to string
 self.space = map(lambda x:len(max(x,key=len))+col_space, # returns the max string length in each column as a list
 zip(*str_data)) 
 col_titles = [self._format_line(list(self))]
 col_data = [self._format_line(row) for row in str_data ]
 lines = col_titles + col_data
 return lines
 def _format_line(self, row_vals):
 """row_vals: list of strings.
 Adds variable amount of white space to each
 list entry and returns a single string"""
 line_val_gen = ( ('{0: >%d}'%self.space[i]).format(entry) for i,entry in enumerate(row_vals) ) # takes dataframe row entries and adds white spaces based on a format
 line = ''.join(line_val_gen)
 return line

Here I make some test data

import random
import numpy
#SOME TEST DATA
df = DataFrame2({'A':numpy.random.randint(0,1000,1000), 
 'B':numpy.random.random(1000), 
 'C':[random.choice(['EYE', '<3', 'PANDAS', '0.16']) 
 for _ in range(1000)]})

Method outputs

df.get_lines_standard()
#[u' A B C',
# u' 504 0.924385 <3',
# u' 388 0.285854 0.16',
# u' 984 0.254156 0.16',
# u' 446 0.472621 PANDAS']
# ...
df.get_lines_fast_struct()
#[' A B C',
# ' 504 0.9243853594 <3',
# ' 388 0.285854082778 0.16',
# ' 984 0.254155910401 0.16',
# ' 446 0.472621088021 PANDAS']
# ...
df.get_lines_fast_unstruct()
#['A B C',
# '504 0.9243853594 <3',
# '388 0.285854082778 0.16',
# '984 0.254155910401 0.16',
# '446 0.472621088021 PANDAS']
# ...

Timing results

In [262]: %timeit df.get_lines_standard()
10 loops, best of 3: 70.3 ms per loop
In [263]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 15.4 ms per loop
In [264]: %timeit df.get_lines_fast_unstruct()
100 loops, best of 3: 2.3 ms per loop

Question 2

import pandas
np = pandas.np

What you are doing here is using the numpy that pandas imports, which can lead to confusion. There is an agreed standard to import pandas and numpy:

import pandas as pd
import numpy as np

And importing numpy yourself does not load the module twice, as imports are cached. Your import only costs a lookup in sys.modules because numpy already gets imported on the pandas import, but you add a lot of readability.

At the end you use random.choice() but you never imported random.

In get_lines_standard() you first convert the complete DataFrame to a string, then split it on the line breaks. In your example and then you slice the top 5 off it to display. The way your code works here, there is no way to only show the top 5 rows without rendering the complete DataFrame - which applies to all 3 methods. Just to demonstrate the difference of slicing before and after (using random data generated at the end of your code but with 10k rows instead of 1k):

# both calls have the same output:
%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop
%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loop

PS: I don't want to pep8ify you, but please don't line up your equal signs.

/edit:

Ok, let's focus on get_lines_fast_struct(). You are doing things manually for which there actually exist tools:

Creating a copy of a DataFrame with the same values as strings can be accomplished by str_df = self.astype(str)
The maximum lengths of cells per column of such a dataframe could be determined by self.spaces= [str_df[c].map(len).max() for c in str_df.columns]
For col_data you use a list comprehension that just call a method for each element, which is basically just map()
In _format_line() you fill up the strings with spaces on the left until they have length n+1 with n being the maximum col length by even mixing 2 styles of string formatting (old and new). string.rjust() does the same thing and might be faster.

All those things in mind the code might look like this:

def get_lines_fast_struct2(self, col_space=1):
 str_df = self.astype(str)
 self.space = [str_df[c].map(len).max() for c in str_df.columns]
 col_titles = map(_format_line2, [self.columns])
 col_data = map(_format_line2, str_df.to_records(index=False))
 return col_titles + col_data 
def _format_line2(self, row_vals):
 return "".join(cell.rjust(width) for (cell, width) in zip(row_vals, self.space))

Let's compare this with the original in terms of speed and equality:

In [160]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 11.3 ms per loop
In [161]: %timeit df.get_lines_fast_struct2()
100 loops, best of 3: 9.78 ms per loop
In [162]: df.get_lines_fast_struct() == df.get_lines_fast_struct2()
Out[162]: True

Maybe there is even a better way with more pandas magic involved, but I am not that experienced with pandas yet.

Question 3

Thanks for the reply, and for the tips! I don't quite understand why you focus on the point of my slicing the output func()[:5]- that was merely to display the formatting results of each method without printing the full list. The main concern I have is as stated in the post - I am looking to improve upon the get_lines_fast_struct method. Any suggestions you have there ? Anyhow, thanks again for offering tips.

Question 4

Hey there, I guess I got caught up and missed your question bit. I edited my answer and added a focus on get_lines_fast_struct().

Question 5

I just noticed I missed to use col_space in get_lines_fast_struct2() but I guess you'll figure it out.

moritzbracht moritzbracht 1916 bronze badges · Accepted Answer · 2015-08-09 01:34:23Z

import pandas
np = pandas.np

What you are doing here is using the numpy that pandas imports, which can lead to confusion. There is an agreed standard to import pandas and numpy:

import pandas as pd
import numpy as np

And importing numpy yourself does not load the module twice, as imports are cached. Your import only costs a lookup in sys.modules because numpy already gets imported on the pandas import, but you add a lot of readability.

At the end you use random.choice() but you never imported random.

In get_lines_standard() you first convert the complete DataFrame to a string, then split it on the line breaks. In your example and then you slice the top 5 off it to display. The way your code works here, there is no way to only show the top 5 rows without rendering the complete DataFrame - which applies to all 3 methods. Just to demonstrate the difference of slicing before and after (using random data generated at the end of your code but with 10k rows instead of 1k):

# both calls have the same output:
%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop
%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loop

PS: I don't want to pep8ify you, but please don't line up your equal signs.

/edit:

Ok, let's focus on get_lines_fast_struct(). You are doing things manually for which there actually exist tools:

Creating a copy of a DataFrame with the same values as strings can be accomplished by str_df = self.astype(str)
The maximum lengths of cells per column of such a dataframe could be determined by self.spaces= [str_df[c].map(len).max() for c in str_df.columns]
For col_data you use a list comprehension that just call a method for each element, which is basically just map()
In _format_line() you fill up the strings with spaces on the left until they have length n+1 with n being the maximum col length by even mixing 2 styles of string formatting (old and new). string.rjust() does the same thing and might be faster.

All those things in mind the code might look like this:

def get_lines_fast_struct2(self, col_space=1):
 str_df = self.astype(str)
 self.space = [str_df[c].map(len).max() for c in str_df.columns]
 col_titles = map(_format_line2, [self.columns])
 col_data = map(_format_line2, str_df.to_records(index=False))
 return col_titles + col_data 
def _format_line2(self, row_vals):
 return "".join(cell.rjust(width) for (cell, width) in zip(row_vals, self.space))

Let's compare this with the original in terms of speed and equality:

In [160]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 11.3 ms per loop
In [161]: %timeit df.get_lines_fast_struct2()
100 loops, best of 3: 9.78 ms per loop
In [162]: df.get_lines_fast_struct() == df.get_lines_fast_struct2()
Out[162]: True

Maybe there is even a better way with more pandas magic involved, but I am not that experienced with pandas yet.

Thanks for the reply, and for the tips! I don't quite understand why you focus on the point of my slicing the output func()[:5]- that was merely to display the formatting results of each method without printing the full list. The main concern I have is as stated in the post - I am looking to improve upon the get_lines_fast_struct method. Any suggestions you have there ? Anyhow, thanks again for offering tips.
Hey there, I guess I got caught up and missed your question bit. I edited my answer and added a focus on get_lines_fast_struct().
I just noticed I missed to use col_space in get_lines_fast_struct2() but I guess you'll figure it out.

Stack Exchange Network

A custom Pandas dataframe to_string method

Method outputs

Timing results

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A custom Pandas dataframe to_string method

Method outputs

Timing results

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions