Categorization algorithm for discrete variables

Question 1

I am trying to categorize some data. For that I check the distribution of the data. Then I split based on the number of appearance of each value. The algorithm I have is working so far but really slow. I am looking to improve the speed. The speed is important on this one because I treat a lot of different data using the same structure and the data is a bit large (140k rows)

def RamsesIdCategory(data):
 # handling Ramses Id:
 print('Starting Ramses Id')
 valueRamses = data['Ramses Trade Id'].unique()
 countRamses = data['Ramses Trade Id'].value_counts()
 for value in valueRamses:
 if countRamses.get(value) < 2:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 1
 elif 2 <= countRamses.get(value) < 5:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 2
 elif 5 <= countRamses.get(value) < 10:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 3
 elif 10 <= countRamses.get(value) < 20:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id']== value] = 4
 elif 20 <= countRamses.get(value) < 32:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id']== value] = 5
 else:
 data['Ramses Trade Id'].loc[data['Ramses Trade Id'] == value] = 6
 print('finished Ramses Id')
 return data

EDIT : I reworked my code as I knew there was the problem with the loop doing too much iterations over my rows. Here is the new version :

def RamsesIdCategory(data):
 # handling Ramses Id:
 print('Starting Ramses Id')
 valueRamses= data['Ramses Trade Id'].value_counts()
 for i in data.index:
 if valueRamses.get(data.get_value(i,'Ramses Trade Id'))<2:
 data.set_value(i,'Ramses Trade Id',1)
 elif 2<=valueRamses.get(data.get_value(i,'Ramses Trade Id'))<5:
 data.set_value(i, 'Ramses Trade Id', 2)
 elif 5 <= valueRamses.get(data.get_value(i, 'Ramses Trade Id')) < 10:
 data.set_value(i, 'Ramses Trade Id', 3)
 elif 10<= valueRamses.get(data.get_value(i, 'Ramses Trade Id')) < 20:
 data.set_value(i, 'Ramses Trade Id', 4)
 elif 20 <= valueRamses.get(data.get_value(i, 'Ramses Trade Id')) < 32:
 data.set_value(i, 'Ramses Trade Id', 5)
 else:
 data.set_value(i, 'Ramses Trade Id', 6)
 return(data)

I iterrate over my whole dataset once and do a single select and modification instead of trying to do a multiple modification on the entire dataframe for each different value. It is 100 time faster as it ran in few sec vs 50min

Question 2

The code can be largely simplified using apply. But first, you need a better way to test your values and assign them an id:

def convert_count_to_id(count, limits=(2, 5, 10, 20, 32)):
 for id, limit in enumerate(limits, 1):
 if count < limit:
 return id
 return id + 1

This is equivalent to your elifs chain but harder to get wrong.

Now your function can become:

def ramses_id_category(data):
 serie_name = 'Ramses Trade Id'
 value_ramses = data[serie_name].value_counts()
 id_ramses = values_ramses.apply(convert_count_to_id)
 data[serie_name] = data[serie_name].apply(id_ramses.get)

Note that I removed the return data at the end. Since you are mutating the parameter in place, there is no need to return it back since the caller will already be able to see the changes on the reference they hold when calling this function.

Question 3

How does the switch to apply affect the speed? codereview.stackexchange.com/questions/157797/… uses apply but worries about the speed.

Question 4

@hpaulj I said nothing about speed, that was not my point.

Question 5

Maybe I'm asking for the other questioner. I don't use Pandas much, and don't have a feel for the speed of various methods.

Question 6

alright so here you assigned default values to the limit parameter in the convert_count_to_id but I can change it and apply it to all the values I got that are split under different limits right?

Question 7

@Mayeulsgc Yes, and you can even change the number of limits. You can change them at the function definition or at the calling point. You can even use functools.partial to be able to change them in an apply call.

score 2 · Accepted Answer · 2017-03-15 15:04:30Z

2

\$\begingroup\$

The code can be largely simplified using apply. But first, you need a better way to test your values and assign them an id:

def convert_count_to_id(count, limits=(2, 5, 10, 20, 32)):
 for id, limit in enumerate(limits, 1):
 if count < limit:
 return id
 return id + 1

This is equivalent to your elifs chain but harder to get wrong.

Now your function can become:

def ramses_id_category(data):
 serie_name = 'Ramses Trade Id'
 value_ramses = data[serie_name].value_counts()
 id_ramses = values_ramses.apply(convert_count_to_id)
 data[serie_name] = data[serie_name].apply(id_ramses.get)

Note that I removed the return data at the end. Since you are mutating the parameter in place, there is no need to return it back since the caller will already be able to see the changes on the reference they hold when calling this function.

Share

edited Mar 15, 2017 at 20:20

answered Mar 15, 2017 at 15:04

301_Moved_Permanently's user avatar

301_Moved_Permanently 301_Moved_Permanently

29.4k3 gold badges48 silver badges98 bronze badges

\$\endgroup\$

5

\$\begingroup\$ How does the switch to apply affect the speed? codereview.stackexchange.com/questions/157797/… uses apply but worries about the speed. \$\endgroup\$

hpaulj
– hpaulj

2017年03月15日 18:10:38 +00:00
Commented Mar 15, 2017 at 18:10
\$\begingroup\$ @hpaulj I said nothing about speed, that was not my point. \$\endgroup\$

301_Moved_Permanently
– 301_Moved_Permanently

2017年03月15日 19:13:33 +00:00
Commented Mar 15, 2017 at 19:13
\$\begingroup\$ Maybe I'm asking for the other questioner. I don't use Pandas much, and don't have a feel for the speed of various methods. \$\endgroup\$

hpaulj
– hpaulj

2017年03月15日 19:36:58 +00:00
Commented Mar 15, 2017 at 19:36
\$\begingroup\$ alright so here you assigned default values to the limit parameter in the convert_count_to_id but I can change it and apply it to all the values I got that are split under different limits right? \$\endgroup\$

Mayeul sgc
– Mayeul sgc

2017年03月16日 01:57:55 +00:00
Commented Mar 16, 2017 at 1:57
\$\begingroup\$ @Mayeulsgc Yes, and you can even change the number of limits. You can change them at the function definition or at the calling point. You can even use functools.partial to be able to change them in an apply call. \$\endgroup\$

301_Moved_Permanently
– 301_Moved_Permanently

2017年03月16日 07:22:28 +00:00
Commented Mar 16, 2017 at 7:22

Add a comment |

Stack Exchange Network

Categorization algorithm for discrete variables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Categorization algorithm for discrete variables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions