The third column in my numpy array is Age. In this column about 75% of the entries are valid and 25% are blank. Column 2 is Gender and using some manipulation I have calculated the average age of the men in my dataset to be 30. The average age of women in my dataset is 28.
I want to replace all blank Age values for men to be 30 and all blank age values for women to be 28.
However I can't seem to do this. Anyone have a suggestion or know what I am doing wrong?
Here is my code:
# my entire data set is stored in a numpy array defined as x
ismale = x[::,1]=='male'
maleAgeBlank = x[ismale][::,2]==''
x[ismale][maleAgeBlank][::,2] = 30
For whatever reason when I'm done with the above code, I type x to display the data set and the blanks still exist even though I set them to 30. Note that I cannot do x[maleAgeBlank] because that list will include some female data points since the female data points are not yet excluded.
Is there any way to get what I want? For some reason, if I do x[ismale][::,1] = 1 (setting the column with 'male' equal to 1), that works, but x[ismale][maleAgeBlank][::,2] = 30 does not work.
sample of array:
#output from typing x
array([['3', '1', '22', ..., '0', '7.25', '2'],
['1', '0', '38', ..., '0', '71.2833', '0'],
['3', '0', '26', ..., '0', '7.925', '2'],
...,
['3', '0', '', ..., '2', '23.45', '2'],
['1', '1', '26', ..., '0', '30', '0'],
['3', '1', '32', ..., '0', '7.75', '1']],
dtype='<U82')
#output from typing x[0]
array(['3', '1', '22', '1', '0', '7.25', '2'],
dtype='<U82')
Note that I have changed column 2 to be 0 for female and 1 for male already in the above output
-
can you post a sample of the array?user1301404– user13014042013年11月10日 00:40:37 +00:00Commented Nov 10, 2013 at 0:40
3 Answers 3
How about this:
my_data = np.array([['3', '1', '22', '0', '7.25', '2'],
['1', '0', '38', '0', '71.2833', '0'],
['3', '0', '26', '0', '7.925', '2'],
['3', '0', '', '2', '23.45', '2'],
['1', '1', '26', '0', '30', '0'],
['3', '1', '32', '0', '7.75', '1']],
dtype='<U82')
ismale = my_data[:,1] == '0'
missing_age = my_data[:, 2] == ''
maleAgeBlank = missing_age & ismale
my_data[maleAgeBlank, 2] = '30'
Result:
>>> my_data
array([[u'3', u'1', u'22', u'0', u'7.25', u'2'],
[u'1', u'0', u'38', u'0', u'71.2833', u'0'],
[u'3', u'0', u'26', u'0', u'7.925', u'2'],
[u'3', u'0', u'30', u'2', u'23.45', u'2'],
[u'1', u'1', u'26', u'0', u'30', u'0'],
[u'3', u'1', u'32', u'0', u'7.75', u'1']],
dtype='<U82')
1 Comment
You can use the where function:
arr = array([['3', '1', '22', '1', '0', '7.25', '2'],
['3', '', '22', '1', '0', '7.25', '2']],
dtype='<U82')
blank = np.where(arr=='')
arr[blank] = 20
array([[u'3', u'1', u'22', u'1', u'0', u'7.25', u'2'],
[u'3', u'20', u'22', u'1', u'0', u'7.25', u'2']],
dtype='<U82')
If you want to change a specific column you can do the do the following:
male = np.where(arr[:, 1]=='') # where 1 is the column
arr[male] = 30
female = np.where(arr[:, 2]=='') # where 2 is the column
arr[female] = 28
2 Comments
where is efficient, but the current solution doesn't check the row's gender value and changes all blanks, not just those in the age column.You could try iterating through the array in a simpler way. It's not the most efficient solution, but it should get the job done.
for row in range(len(x)):
if row[2] == '':
if row[1] == 1:
row[2] == 30
else:
row[2] == 28