You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DataCamp_Notes/Optimizing Python Code with pandas.txt
+240-7Lines changed: 240 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -832,22 +832,255 @@ print(poker_var[0:5])
832
832
833
833
+100 XP
834
834
Congratulations! You have mastered all the techniques to iterate through a pandas DataFrame and apply functions on its values! In case you wonder, it's expected to get 5 identical values. The dataset contains all the possible combinations of 5 cards from a standard deck of cards: the columns all contain the same cards, although in a different order, so the variance is the same for all columns.
This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.
A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).
850
+
851
+
You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.
852
+
853
+
Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.
854
+
855
+
Instructions
856
+
100 XP
857
+
Instructions
858
+
100 XP
859
+
Define the min-max normalization using the lambda method.
860
+
Group the data according to the time the meal took place.
In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.
882
+
883
+
The transformation will be a exponential transformation. The exponential distribution is defined as
884
+
885
+
e−λ∗x∗λ
886
+
where λ (lambda) is the mean of the group that the observation x belongs to.
887
+
888
+
You're going to apply the exponential distribution transformation to the size of each table in the dataset, after grouping the data according to the time of the day the meal took place. Remember to use each group's mean for the value of λ.
889
+
890
+
In Python, you can use the exponential as np.exp() from the NumPy library and the mean value as .mean().
891
+
892
+
Instructions
893
+
100 XP
894
+
Define the exponential distribution transformation exp_tr.
895
+
Group the data according to the time the meal took place.
For this exercise, we will perform a z-score normalization and verify that it was performed correctly.
843
926
927
+
A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.
844
928
929
+
After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.
930
+
931
+
You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.
932
+
933
+
Instructions 1/2
934
+
50 XP
935
+
1
936
+
2
937
+
Apply the normalization transformation to the grouped object poker_grouped.
938
+
939
+
zscore = lambda x: (x - x.mean()) / x.std()
940
+
941
+
# Apply the transformation
942
+
poker_trans = poker_grouped.transform(zscore)
943
+
print(poker_trans.head())
944
+
945
+
946
+
Group poker_trans by class and print the mean and standard deviation to validate the normalization was done correctly
The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.
970
+
971
+
For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.
972
+
973
+
Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.
974
+
975
+
We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not impute financial data in real life, as it could be considered fraud.
976
+
977
+
Instructions
978
+
100 XP
979
+
Group the data according to smoking status.
980
+
Calculate the number of non-missing values in each group.
As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.
1007
+
1008
+
In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).
1009
+
1010
+
Instructions 1/2
1011
+
50 XP
1012
+
1
1013
+
2
1014
+
Define the lambda function that fills missing values with the median.
1015
+
1016
+
1017
+
# Define the lambda function
1018
+
missing_trans = lambda x: x.fillna(x.median())
1019
+
1020
+
Instructions 2/2
1021
+
50 XP
1022
+
2
1023
+
Group the data according to the time of each entry.
1024
+
Apply and print the pre-defined transformation to impute the missing values in the restaurant_data dataset.
This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.
0 commit comments