5
$\begingroup$

I'd like to use a random forest for predicting how long a person will stay a customer of our company. One feature I'd like to use is the average age of the customer's kids.

The problem is some customers don't have kids so I can't compute an average. Moreover I can't put 0 because this value means something else: the customer just had his first kid.

How could I handle these missing values in random Forest?

Does it make sense to substitute with a impossible value like -1? If yes is it even better to use a big negative values like - 500?

gung - Reinstate Monica
150k90 gold badges416 silver badges735 bronze badges
asked Jul 24, 2015 at 16:52
$\endgroup$
2
  • 5
    $\begingroup$ Put 0 and make a second feature that is a binary indicator for whether or not the customer has kids. Feature scaling is irrelevant in RF since it just looks for cutpoints, not at the range of values. $\endgroup$ Commented Jul 24, 2015 at 17:18
  • 4
    $\begingroup$ As feature scaling is indeed irrelevant, could you not code the missing values with -1 and leave the zeros as is? $\endgroup$ Commented Jul 24, 2015 at 17:47

2 Answers 2

1
$\begingroup$
  1. The traditional solution for regression is to use two variables: (i) an indicator variable (0/1) that denotes whether the individual has children or not, and (ii) a variable with the average age of the children. This has the advantage of the data being completely unambiguous. I would recommend this approach.

  2. But as noted in comments, this approach is not strictly necessary for random forests, which just look for breaks/cutpoints in a data range. So retaining the single variable and using -1 would work fine. Any value more extreme than the data would work, but an impossible value (for mean age) like a negative is less ambiguous and therefore less likely to cause confusion when someone else looks at the data sheet.

  3. For completeness, there is a third solution. Some random forest implementations can also split on NAs. Which means that at a node, one split is created that specifically indicates whether the value is missing. The node may therefore take 3 values, for e.g. $X_1$ > 10, $X_1$ < 10, and $X_1$ = NA. So strictly speaking, you could replace the -1 in the second solution with an NA. But it's better to avoid this unless you add the indicator variable. This is because there may be cases where the customer has children but their ages are unknown, leading to two possible meanings for NA - a situation to be avoided because it would lead to a poorer model.

answered Sep 1, 2024 at 17:06
$\endgroup$
-1
$\begingroup$

In your case I would start with some feature engineering. As mentioned by @user777, start with creating a new variable that indicates whether or not a customer has kids. Then I would look into more details, maybe an indicator with the number of kids. If you have the age of the kids, look into creating age range buckets, or check if a polynomial of age has any relevance.

answered Jul 25, 2015 at 12:00
$\endgroup$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.