Normalizing data over a distribution

Question 1

I have a list X containg the data performed by different users N so the the number of the user is i=0,1,....,N-1. Each entry Xi has a different length. I want to normalize the value of each user Xi over the global dataset X.

This is what I am doing. First of all I create a 1D list containing all the data, so:

tmp = list()
for i in range(0,len(X)):
 tmp.extend(X[i])

then I convert it to an array and I remove outliers and NaN.

A = np.array(tmp)
A = A[~np.isnan(A)] #remove NaN
tr = np.percentile(A,95)
A = A[A < tr] #remove outliers

and then I create the histogram of this dataset

p, x = np.histogram(A, bins=10) # bin it into n = N/10 bins

finally I normalize the value of each users over the histogram I created, so:

Xn = list()
for i in range(0,len(X)):
 tmp = np.array(X[i])
 tmp = tmp[tmp < tr]
 tmp = np.histogram(tmp, x)
 Xn.append(append(tmp[0]/sum(tmp[0]))

My data set is very large and this process could take a while. I am wondering if there is e a better way to do that or a package.

Question 2

Would you mind sharing an example of the input as well as the desired output?

Question 3

I tried to change the descirption

Question 4

I imagine the slowest parts of your process are the two loops, for i in range... so that's where I would focus on optimizing. Have you considered using pandas? If you can provide a sample of the input data, your list X then it will be easier to make a recommendation that will give you the desired output that would take less time. Are you trying to get the histogram so that you can graph it?

Question 5

For large datasets, avoid converting between native Python list and NumPy array objects as much as possible. Look at the np.loadtxt and np.genfromtxt functions. They may help you go from saved files of your data to NumPy arrays without having to make any Python lists at all. But suppose you do have a Python list. You shouldn't have to convert all of the data to a NumPy array, and then later convert each users's data to an array separately. I would try something like this, assuming that np.loadtxt doesn't work for you:
```
data_lengths = [len(Xi) for Xi in X]
num_users = len(X)
max_length = max(data_lengths)
all_data = np.zeros(shape = (num_users, max_length), dtype = 'int')
for row, Xi in enumerate(X):
 row_length = len(Xi)
 all_data[row, 0:row_length] = Xi
```
From then on, every operation on your data should be an operation on a NumPy array instead of on a Python list. The way I wrote it, it assumes your data are integers, and also that 0 can never occur as a real data point. You can modify the dtype and offset of the call to np.zeros accordingly to meet the requirements of your particular data.

This approach will only be good if each user has a number of data points that is not too different from the number for the other users. Other wise representing your data as a full matrix will be memory inefficient.
Use dtypes. If your data are non-negative integers, for example, then np.bincount() will be much faster than np.histogram, for example. Actually, if your data are integers, then you could probably just use collections.Counter() to make your histograms in native Python, which could also save time.

Curt F. Curt F.Curt F. 1,65611 silver badges22 bronze badges · Accepted Answer · 2016-05-22 23:59:52Z

For large datasets, avoid converting between native Python list and NumPy array objects as much as possible. Look at the np.loadtxt and np.genfromtxt functions. They may help you go from saved files of your data to NumPy arrays without having to make any Python lists at all. But suppose you do have a Python list. You shouldn't have to convert all of the data to a NumPy array, and then later convert each users's data to an array separately. I would try something like this, assuming that np.loadtxt doesn't work for you:
```
data_lengths = [len(Xi) for Xi in X]
num_users = len(X)
max_length = max(data_lengths)
all_data = np.zeros(shape = (num_users, max_length), dtype = 'int')
for row, Xi in enumerate(X):
 row_length = len(Xi)
 all_data[row, 0:row_length] = Xi
```
From then on, every operation on your data should be an operation on a NumPy array instead of on a Python list. The way I wrote it, it assumes your data are integers, and also that 0 can never occur as a real data point. You can modify the dtype and offset of the call to np.zeros accordingly to meet the requirements of your particular data.

This approach will only be good if each user has a number of data points that is not too different from the number for the other users. Other wise representing your data as a full matrix will be memory inefficient.
Use dtypes. If your data are non-negative integers, for example, then np.bincount() will be much faster than np.histogram, for example. Actually, if your data are integers, then you could probably just use collections.Counter() to make your histograms in native Python, which could also save time.

Stack Exchange Network

Normalizing data over a distribution

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Normalizing data over a distribution

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions