I have a data set with close to 6 million rows of user input. Specifically, users were supposed to type in their email addresses, but because there was not pattern validation put in place we have a few months worth of interesting input.
I've come up with a script that counts every character, then combines it that so I can see the distribution of all characters. This enables me to do further analysis and get a sense of the most common mistakes so I can begin to clean the data. My question is: how would you optimize the following for speed?
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from collections import Counter
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <[email protected]>','[email protected]','geordi @starfleet.com','[email protected]','rik#[email protected]'],
'metric1': np.random.randn(5).cumsum(),
'metric2': np.random.randn(5)})
l = []
for i in range(len(df.index.values)):
l.append(dict(Counter(df.ix[i,'input'])))
dist = pd.DataFrame(l).fillna(0)
dist = dist.sum(axis=0)
print(dist)
I've run this over ~1/3 of my dataset, and it takes a while; it's still tolerable, I'm just curious if anyone could make it faster.
2 Answers 2
Since you are using Counter
already, it should be faster to do the whole job with it:
c = Counter()
for i in range(len(df.index.values)):
c.update(df.ix[i,'input'])
for k, v in c.items():
print(k, v)
This is the shortest possibility:
from collections import Counter
dist = Counter(''.join(df.input.tolist()))
which results in
Counter({'a': 14, 'e': 14, 't': 13, 'r': 11, 'c': 8, 'o': 7, '.': 6, 'i': 6, '@': 5, 'd': 5, 'f': 5, 'm': 5, 'l': 5, 's': 5, ' ': 4, 'n': 4, 'p': 2, '#': 1, '-': 1, '<': 1, '>': 1, 'C': 1, 'J': 1, 'L': 1, 'P': 1, 'g': 1, 'k': 1, 'u': 1})
What ''.join(df.input.tolist())
does:
>>> ''.join(df.input.tolist())
'Captain Jean-Luc Picard <[email protected]>[email protected] @[email protected]#[email protected]'
It joins all the strings in our list here. This one string can now be handed over to Counter
.
dist
is now a Counter
object, which can be used just like a regular dictionary. However you can convert it just by dict(dist)
.