Random Forest and missing values in numeric features

Question 1

I'd like to use a random forest for predicting how long a person will stay a customer of our company. One feature I'd like to use is the average age of the customer's kids.

The problem is some customers don't have kids so I can't compute an average. Moreover I can't put 0 because this value means something else: the customer just had his first kid.

How could I handle these missing values in random Forest?

Does it make sense to substitute with a impossible value like -1? If yes is it even better to use a big negative values like - 500?

Question 2

Put 0 and make a second feature that is a binary indicator for whether or not the customer has kids. Feature scaling is irrelevant in RF since it just looks for cutpoints, not at the range of values.

Question 3

As feature scaling is indeed irrelevant, could you not code the missing values with -1 and leave the zeros as is?

Question 4

The traditional solution for regression is to use two variables: (i) an indicator variable (0/1) that denotes whether the individual has children or not, and (ii) a variable with the average age of the children. This has the advantage of the data being completely unambiguous. I would recommend this approach.
But as noted in comments, this approach is not strictly necessary for random forests, which just look for breaks/cutpoints in a data range. So retaining the single variable and using -1 would work fine. Any value more extreme than the data would work, but an impossible value (for mean age) like a negative is less ambiguous and therefore less likely to cause confusion when someone else looks at the data sheet.
For completeness, there is a third solution. Some random forest implementations can also split on NAs. Which means that at a node, one split is created that specifically indicates whether the value is missing. The node may therefore take 3 values, for e.g. $X_1$ > 10, $X_1$ < 10, and $X_1$ = NA. So strictly speaking, you could replace the -1 in the second solution with an NA. But it's better to avoid this unless you add the indicator variable. This is because there may be cases where the customer has children but their ages are unknown, leading to two possible meanings for NA - a situation to be avoided because it would lead to a poorer model.

Question 5

In your case I would start with some feature engineering. As mentioned by @user777, start with creating a new variable that indicates whether or not a customer has kids. Then I would look into more details, maybe an indicator with the number of kids. If you have the age of the kids, look into creating age range buckets, or check if a polynomial of age has any relevance.

mkt mkt 21.1k11 gold badges83 silver badges191 bronze badges · Answer 1 · 2024-09-01 17:06:27Z

The traditional solution for regression is to use two variables: (i) an indicator variable (0/1) that denotes whether the individual has children or not, and (ii) a variable with the average age of the children. This has the advantage of the data being completely unambiguous. I would recommend this approach.
But as noted in comments, this approach is not strictly necessary for random forests, which just look for breaks/cutpoints in a data range. So retaining the single variable and using -1 would work fine. Any value more extreme than the data would work, but an impossible value (for mean age) like a negative is less ambiguous and therefore less likely to cause confusion when someone else looks at the data sheet.
For completeness, there is a third solution. Some random forest implementations can also split on NAs. Which means that at a node, one split is created that specifically indicates whether the value is missing. The node may therefore take 3 values, for e.g. $X_1$ > 10, $X_1$ < 10, and $X_1$ = NA. So strictly speaking, you could replace the -1 in the second solution with an NA. But it's better to avoid this unless you add the indicator variable. This is because there may be cases where the customer has children but their ages are unknown, leading to two possible meanings for NA - a situation to be avoided because it would lead to a poorer model.

phiver phiver 1,0451 gold badge9 silver badges15 bronze badges · Answer 2 · 2015-07-25 12:00:41Z

In your case I would start with some feature engineering. As mentioned by @user777, start with creating a new variable that indicates whether or not a customer has kids. Then I would look into more details, maybe an indicator with the number of kids. If you have the age of the kids, look into creating age range buckets, or check if a polynomial of age has any relevance.

Stack Exchange Network

Random Forest and missing values in numeric features

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Random Forest and missing values in numeric features

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions