I have a list
X
containg the data performed by different users N
so the the number of the user is i=0,1,....,N-1
. Each entry Xi
has a different length.
I want to normalize the value of each user Xi
over the global dataset X
.
This is what I am doing. First of all I create a 1D
list containing all the data, so:
tmp = list()
for i in range(0,len(X)):
tmp.extend(X[i])
then I convert it to an array and I remove outliers and NaN
.
A = np.array(tmp)
A = A[~np.isnan(A)] #remove NaN
tr = np.percentile(A,95)
A = A[A < tr] #remove outliers
and then I create the histogram of this dataset
p, x = np.histogram(A, bins=10) # bin it into n = N/10 bins
finally I normalize the value of each users over the histogram I created, so:
Xn = list()
for i in range(0,len(X)):
tmp = np.array(X[i])
tmp = tmp[tmp < tr]
tmp = np.histogram(tmp, x)
Xn.append(append(tmp[0]/sum(tmp[0]))
My data set is very large and this process could take a while. I am wondering if there is e a better way to do that or a package.
1 Answer 1
For large datasets, avoid converting between native Python
list
and NumPyarray
objects as much as possible. Look at thenp.loadtxt
andnp.genfromtxt
functions. They may help you go from saved files of your data to NumPy arrays without having to make any Python lists at all. But suppose you do have a Python list. You shouldn't have to convert all of the data to a NumPy array, and then later convert each users's data to an array separately. I would try something like this, assuming thatnp.loadtxt
doesn't work for you:data_lengths = [len(Xi) for Xi in X] num_users = len(X) max_length = max(data_lengths) all_data = np.zeros(shape = (num_users, max_length), dtype = 'int') for row, Xi in enumerate(X): row_length = len(Xi) all_data[row, 0:row_length] = Xi
From then on, every operation on your data should be an operation on a NumPy array instead of on a Python list. The way I wrote it, it assumes your data are integers, and also that 0 can never occur as a real data point. You can modify the
dtype
and offset of the call tonp.zeros
accordingly to meet the requirements of your particular data.This approach will only be good if each user has a number of data points that is not too different from the number for the other users. Other wise representing your data as a full matrix will be memory inefficient.
Use
dtype
s. If your data are non-negative integers, for example, thennp.bincount()
will be much faster thannp.histogram
, for example. Actually, if your data are integers, then you could probably just usecollections.Counter()
to make your histograms in native Python, which could also save time.
for i in range...
so that's where I would focus on optimizing. Have you considered usingpandas
? If you can provide a sample of the input data, yourlist X
then it will be easier to make a recommendation that will give you the desired output that would take less time. Are you trying to get the histogram so that you can graph it? \$\endgroup\$