Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 6b4842f

Browse files
authored
Update Optimizing Python Code with pandas.txt
1 parent 8bc75c1 commit 6b4842f

File tree

1 file changed

+240
-7
lines changed

1 file changed

+240
-7
lines changed

‎DataCamp_Notes/Optimizing Python Code with pandas.txt‎

Lines changed: 240 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -832,22 +832,255 @@ print(poker_var[0:5])
832832

833833
+100 XP
834834
Congratulations! You have mastered all the techniques to iterate through a pandas DataFrame and apply functions on its values! In case you wonder, it's expected to get 5 identical values. The dataset contains all the possible combinations of 5 cards from a standard deck of cards: the columns all contain the same cards, although in a different order, so the variance is the same for all columns.
835+
836+
837+
====================================================================================================================================
838+
VIEW CHAPTER DETAILS
839+
4
840+
Data manipulation using .groupby()
841+
0%
842+
This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.
843+
844+
VIEW CHAPTER DETAILS
845+
835846
______________________________________________________________________________________________________________________________
836847

848+
The min-max normalization using .transform()
849+
A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).
850+
851+
You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.
852+
853+
Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.
854+
855+
Instructions
856+
100 XP
857+
Instructions
858+
100 XP
859+
Define the min-max normalization using the lambda method.
860+
Group the data according to the time the meal took place.
861+
Apply the transformation to the grouped data.
862+
863+
864+
865+
# Define the min-max transformation
866+
min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())
867+
868+
# Group the data according to the time
869+
restaurant_grouped = restaurant_data.groupby('time')
870+
871+
# Apply the transformation
872+
restaurant_min_max_group = restaurant_grouped.transform(min_max_tr)
873+
print(restaurant_min_max_group.head())
874+
837875

876+
+100 XP
877+
Well done! You can now use the transform function in any transformation you can define!
838878
______________________________________________________________________________________________________________________________
839879

880+
Transforming values to probabilities
881+
In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.
882+
883+
The transformation will be a exponential transformation. The exponential distribution is defined as
884+
885+
e−λ∗x∗λ
886+
where λ (lambda) is the mean of the group that the observation x belongs to.
887+
888+
You're going to apply the exponential distribution transformation to the size of each table in the dataset, after grouping the data according to the time of the day the meal took place. Remember to use each group's mean for the value of λ.
889+
890+
In Python, you can use the exponential as np.exp() from the NumPy library and the mean value as .mean().
891+
892+
Instructions
893+
100 XP
894+
Define the exponential distribution transformation exp_tr.
895+
Group the data according to the time the meal took place.
896+
Apply the transformation to the grouped data.
897+
898+
899+
900+
# Define the exponential transformation
901+
exp_tr = lambda x: np.exp(-x.mean()*x) * x.mean()
902+
903+
# Group the data according to the time
904+
restaurant_grouped = restaurant_data.groupby('time')
905+
906+
# Apply the transformation
907+
restaurant_exp_group = restaurant_grouped['tip'].transform(exp_tr)
908+
print(restaurant_exp_group.head())
909+
910+
911+
+100 XP
912+
Well done! You can now use the transform function to transform frequencies to probabilities with group-related parameters!
913+
914+
<script.py> output:
915+
0 0.135141
916+
1 0.017986
917+
2 0.000060
918+
3 0.000108
919+
4 0.000042
920+
Name: tip, dtype: float64
840921

841922
______________________________________________________________________________________________________________________________
842923

924+
Validation of normalization
925+
For this exercise, we will perform a z-score normalization and verify that it was performed correctly.
843926

927+
A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.
844928

929+
After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.
930+
931+
You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.
932+
933+
Instructions 1/2
934+
50 XP
935+
1
936+
2
937+
Apply the normalization transformation to the grouped object poker_grouped.
938+
939+
zscore = lambda x: (x - x.mean()) / x.std()
940+
941+
# Apply the transformation
942+
poker_trans = poker_grouped.transform(zscore)
943+
print(poker_trans.head())
944+
945+
946+
Group poker_trans by class and print the mean and standard deviation to validate the normalization was done correctly
947+
948+
949+
950+
zscore = lambda x: (x - x.mean()) / x.std()
951+
952+
# Apply the transformation
953+
poker_trans = poker_grouped.transform(zscore)
954+
955+
# Re-group the grouped object
956+
poker_regrouped = poker_trans.groupby(poker_hands['Class'])
957+
958+
# Print each group's means and standard deviation
959+
print(np.round(poker_regrouped.mean(), 3))
960+
print(poker_regrouped.std())
961+
962+
+50 XP
963+
Well done! Now you know that the normalization was performed correctly, as the mean in every normalized group is 0 and the standard deviation is 1!
964+
965+
966+
______________________________________________________________________________________________________________________________
967+
968+
Identifying missing values
969+
The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.
970+
971+
For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.
972+
973+
Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.
974+
975+
We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not impute financial data in real life, as it could be considered fraud.
976+
977+
Instructions
978+
100 XP
979+
Group the data according to smoking status.
980+
Calculate the number of non-missing values in each group.
981+
Print the number of missing values in each group.
982+
983+
# Group both objects according to smoke condition
984+
restaurant_nan_grouped = restaurant_nan.groupby('smoker')
985+
986+
# Store the number of present values
987+
restaurant_nan_nval = restaurant_nan_grouped['tip'].count()
988+
989+
# Print the group-wise missing entries
990+
print(restaurant_nan_grouped['total_bill'].count() - restaurant_nan_nval)
991+
992+
993+
<script.py> output:
994+
smoker
995+
No 41
996+
Yes 24
997+
dtype: int64
998+
999+
1000+
+100 XP
1001+
Well done! You know how to compare two grouped objects in terms of group-wise missing values! Let's see how we can fill this gaps.
1002+
______________________________________________________________________________________________________________________________
1003+
1004+
1005+
Missing value imputation
1006+
As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.
1007+
1008+
In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).
1009+
1010+
Instructions 1/2
1011+
50 XP
1012+
1
1013+
2
1014+
Define the lambda function that fills missing values with the median.
1015+
1016+
1017+
# Define the lambda function
1018+
missing_trans = lambda x: x.fillna(x.median())
1019+
1020+
Instructions 2/2
1021+
50 XP
1022+
2
1023+
Group the data according to the time of each entry.
1024+
Apply and print the pre-defined transformation to impute the missing values in the restaurant_data dataset.
1025+
1026+
# Define the lambda function
1027+
missing_trans = lambda x: x.fillna(x.median())
1028+
1029+
# Group the data according to time
1030+
restaurant_grouped = restaurant_data.groupby('time')
1031+
1032+
# Apply the transformation
1033+
restaurant_impute = restaurant_grouped.transform(missing_trans)
1034+
print(restaurant_impute.head())
1035+
1036+
<script.py> output:
1037+
total_bill tip size
1038+
0 16.99 1.01 2
1039+
1 10.34 1.66 3
1040+
2 18.69 3.50 3
1041+
3 23.68 3.31 2
1042+
4 24.59 3.61 4
1043+
1044+
+100 XP
1045+
Congratulations! You can know replace missing values in a dataset groupwise!
1046+
1047+
1048+
______________________________________________________________________________________________________________________________
1049+
1050+
Data filtration
1051+
As you noticed in the video lesson, you may need to filter your data for various reasons.
1052+
1053+
In this exercise, you will use filtering to select a specific part of our DataFrame:
1054+
1055+
by the number of entries recorded in each day of the week
1056+
by the mean amount of money the customers paid to the restaurant each day of the week
1057+
Instructions 1/3
1058+
35 XP
1059+
1
1060+
2
1061+
3
1062+
Create a new DataFrame containing only the days when the count of total_bill is greater than 40.
1063+
1064+
# Filter the days where the count of total_bill is greater than 40ドル
1065+
total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)
1066+
1067+
# Print the number of tables where total_bill is greater than 40ドル
1068+
print('Number of tables where total_bill is greater than 40ドル:', total_bill_40.shape[0])
1069+
1070+
1071+
Instructions 2/3
1072+
35 XP
1073+
2
1074+
3
1075+
From the total_bill_40 DataFrame, select only the entries that have a mean total_bill greater than 20,ドル grouped by day.
1076+
1077+
# Filter the days where the count of total_bill is greater than 40ドル
1078+
total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)
1079+
1080+
# Select only the entries that have a mean total_bill greater than 20ドル
1081+
total_bill_20 = total_bill_40.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)
1082+
1083+
# Print days of the week that have a mean total_bill greater than 20ドル
1084+
print('Days of the week that have a mean total_bill greater than 20ドル:', total_bill_20.day.unique())
8451085

846-
====================================================================================================================================
847-
VIEW CHAPTER DETAILS
848-
4
849-
Data manipulation using .groupby()
850-
0%
851-
This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.
8521086

853-
VIEW CHAPTER DETAILS

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /