Commit 6b4842f

authored

Update Optimizing Python Code with pandas.txt

1 parent 8bc75c1 commit 6b4842fCopy full SHA for 6b4842f

File tree

1 file changed

+240

-7

lines changed

DataCamp_Notes
- Optimizing Python Code with pandas.txt

1 file changed

+240

-7

lines changed

`‎DataCamp_Notes/Optimizing Python Code with pandas.txt‎`

Lines changed: 240 additions & 7 deletions

Original file line number	Diff line number	Diff line change
`@@ -832,22 +832,255 @@ print(poker_var[0:5])`
`832`	`832`
`833`	`833`	`+100 XP`
`834`	`834`	`Congratulations! You have mastered all the techniques to iterate through a pandas DataFrame and apply functions on its values! In case you wonder, it's expected to get 5 identical values. The dataset contains all the possible combinations of 5 cards from a standard deck of cards: the columns all contain the same cards, although in a different order, so the variance is the same for all columns.`
	`835`	`+`
	`836`	`+`
	`837`	`+====================================================================================================================================`
	`838`	`+VIEW CHAPTER DETAILS`
	`839`	`+4`
	`840`	`+Data manipulation using .groupby()`
	`841`	`+0%`
	`842`	`+This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.`
	`843`	`+`
	`844`	`+VIEW CHAPTER DETAILS`
	`845`	`+`
`835`	`846`	`______________________________________________________________________________________________________________________________`
`836`	`847`
	`848`	`+The min-max normalization using .transform()`
	`849`	`+A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).`
	`850`	`+`
	`851`	`+You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.`
	`852`	`+`
	`853`	`+Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.`
	`854`	`+`
	`855`	`+Instructions`
	`856`	`+100 XP`
	`857`	`+Instructions`
	`858`	`+100 XP`
	`859`	`+Define the min-max normalization using the lambda method.`
	`860`	`+Group the data according to the time the meal took place.`
	`861`	`+Apply the transformation to the grouped data.`
	`862`	`+`
	`863`	`+`
	`864`	`+`
	`865`	`+# Define the min-max transformation`
	`866`	`+min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())`
	`867`	`+`
	`868`	`+# Group the data according to the time`
	`869`	`+restaurant_grouped = restaurant_data.groupby('time')`
	`870`	`+`
	`871`	`+# Apply the transformation`
	`872`	`+restaurant_min_max_group = restaurant_grouped.transform(min_max_tr)`
	`873`	`+print(restaurant_min_max_group.head())`
	`874`	`+`
`837`	`875`
	`876`	`+ +100 XP`
	`877`	`+Well done! You can now use the transform function in any transformation you can define!`
`838`	`878`	`______________________________________________________________________________________________________________________________`
`839`	`879`
	`880`	`+Transforming values to probabilities`
	`881`	`+In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.`
	`882`	`+`
	`883`	`+The transformation will be a exponential transformation. The exponential distribution is defined as`
	`884`	`+`
	`885`	`+e−λ∗x∗λ`
	`886`	`+where λ (lambda) is the mean of the group that the observation x belongs to.`
	`887`	`+`
	`888`	`+You're going to apply the exponential distribution transformation to the size of each table in the dataset, after grouping the data according to the time of the day the meal took place. Remember to use each group's mean for the value of λ.`
	`889`	`+`
	`890`	`+In Python, you can use the exponential as np.exp() from the NumPy library and the mean value as .mean().`
	`891`	`+`
	`892`	`+Instructions`
	`893`	`+100 XP`
	`894`	`+Define the exponential distribution transformation exp_tr.`
	`895`	`+Group the data according to the time the meal took place.`
	`896`	`+Apply the transformation to the grouped data.`
	`897`	`+`
	`898`	`+`
	`899`	`+`
	`900`	`+# Define the exponential transformation`
	`901`	`+exp_tr = lambda x: np.exp(-x.mean()x) x.mean()`
	`902`	`+`
	`903`	`+# Group the data according to the time`
	`904`	`+restaurant_grouped = restaurant_data.groupby('time')`
	`905`	`+`
	`906`	`+# Apply the transformation`
	`907`	`+restaurant_exp_group = restaurant_grouped['tip'].transform(exp_tr)`
	`908`	`+print(restaurant_exp_group.head())`
	`909`	`+`
	`910`	`+`
	`911`	`+ +100 XP`
	`912`	`+Well done! You can now use the transform function to transform frequencies to probabilities with group-related parameters!`
	`913`	`+`
	`914`	`+<script.py> output:`
	`915`	`+ 0 0.135141`
	`916`	`+ 1 0.017986`
	`917`	`+ 2 0.000060`
	`918`	`+ 3 0.000108`
	`919`	`+ 4 0.000042`
	`920`	`+ Name: tip, dtype: float64`
`840`	`921`
`841`	`922`	`______________________________________________________________________________________________________________________________`
`842`	`923`
	`924`	`+Validation of normalization`
	`925`	`+For this exercise, we will perform a z-score normalization and verify that it was performed correctly.`
`843`	`926`
	`927`	`+A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.`
`844`	`928`
	`929`	`+After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.`
	`930`	`+`
	`931`	`+You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.`
	`932`	`+`
	`933`	`+Instructions 1/2`
	`934`	`+50 XP`
	`935`	`+1`
	`936`	`+2`
	`937`	`+Apply the normalization transformation to the grouped object poker_grouped.`
	`938`	`+`
	`939`	`+zscore = lambda x: (x - x.mean()) / x.std()`
	`940`	`+`
	`941`	`+# Apply the transformation`
	`942`	`+poker_trans = poker_grouped.transform(zscore)`
	`943`	`+print(poker_trans.head())`
	`944`	`+`
	`945`	`+`
	`946`	`+Group poker_trans by class and print the mean and standard deviation to validate the normalization was done correctly`
	`947`	`+`
	`948`	`+`
	`949`	`+`
	`950`	`+zscore = lambda x: (x - x.mean()) / x.std()`
	`951`	`+`
	`952`	`+# Apply the transformation`
	`953`	`+poker_trans = poker_grouped.transform(zscore)`
	`954`	`+`
	`955`	`+# Re-group the grouped object`
	`956`	`+poker_regrouped = poker_trans.groupby(poker_hands['Class'])`
	`957`	`+`
	`958`	`+# Print each group's means and standard deviation`
	`959`	`+print(np.round(poker_regrouped.mean(), 3))`
	`960`	`+print(poker_regrouped.std())`
	`961`	`+`
	`962`	`+ +50 XP`
	`963`	`+Well done! Now you know that the normalization was performed correctly, as the mean in every normalized group is 0 and the standard deviation is 1!`
	`964`	`+`
	`965`	`+`
	`966`	`+______________________________________________________________________________________________________________________________`
	`967`	`+`
	`968`	`+Identifying missing values`
	`969`	`+The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.`
	`970`	`+`
	`971`	`+For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.`
	`972`	`+`
	`973`	`+Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.`
	`974`	`+`
	`975`	`+We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not impute financial data in real life, as it could be considered fraud.`
	`976`	`+`
	`977`	`+Instructions`
	`978`	`+100 XP`
	`979`	`+Group the data according to smoking status.`
	`980`	`+Calculate the number of non-missing values in each group.`
	`981`	`+Print the number of missing values in each group.`
	`982`	`+`
	`983`	`+# Group both objects according to smoke condition`
	`984`	`+restaurant_nan_grouped = restaurant_nan.groupby('smoker')`
	`985`	`+`
	`986`	`+# Store the number of present values`
	`987`	`+restaurant_nan_nval = restaurant_nan_grouped['tip'].count()`
	`988`	`+`
	`989`	`+# Print the group-wise missing entries`
	`990`	`+print(restaurant_nan_grouped['total_bill'].count() - restaurant_nan_nval)`
	`991`	`+`
	`992`	`+`
	`993`	`+<script.py> output:`
	`994`	`+ smoker`
	`995`	`+ No 41`
	`996`	`+ Yes 24`
	`997`	`+ dtype: int64`
	`998`	`+`
	`999`	`+`
	`1000`	`+ +100 XP`
	`1001`	`+Well done! You know how to compare two grouped objects in terms of group-wise missing values! Let's see how we can fill this gaps.`
	`1002`	`+______________________________________________________________________________________________________________________________`
	`1003`	`+`
	`1004`	`+`
	`1005`	`+Missing value imputation`
	`1006`	`+As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.`
	`1007`	`+`
	`1008`	`+In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).`
	`1009`	`+`
	`1010`	`+Instructions 1/2`
	`1011`	`+50 XP`
	`1012`	`+1`
	`1013`	`+2`
	`1014`	`+Define the lambda function that fills missing values with the median.`
	`1015`	`+`
	`1016`	`+`
	`1017`	`+# Define the lambda function`
	`1018`	`+missing_trans = lambda x: x.fillna(x.median())`
	`1019`	`+`
	`1020`	`+Instructions 2/2`
	`1021`	`+50 XP`
	`1022`	`+2`
	`1023`	`+Group the data according to the time of each entry.`
	`1024`	`+Apply and print the pre-defined transformation to impute the missing values in the restaurant_data dataset.`
	`1025`	`+`
	`1026`	`+# Define the lambda function`
	`1027`	`+missing_trans = lambda x: x.fillna(x.median())`
	`1028`	`+`
	`1029`	`+# Group the data according to time`
	`1030`	`+restaurant_grouped = restaurant_data.groupby('time')`
	`1031`	`+`
	`1032`	`+# Apply the transformation`
	`1033`	`+restaurant_impute = restaurant_grouped.transform(missing_trans)`
	`1034`	`+print(restaurant_impute.head())`
	`1035`	`+`
	`1036`	`+<script.py> output:`
	`1037`	`+ total_bill tip size`
	`1038`	`+ 0 16.99 1.01 2`
	`1039`	`+ 1 10.34 1.66 3`
	`1040`	`+ 2 18.69 3.50 3`
	`1041`	`+ 3 23.68 3.31 2`
	`1042`	`+ 4 24.59 3.61 4`
	`1043`	`+`
	`1044`	`+ +100 XP`
	`1045`	`+Congratulations! You can know replace missing values in a dataset groupwise!`
	`1046`	`+`
	`1047`	`+`
	`1048`	`+______________________________________________________________________________________________________________________________`
	`1049`	`+`
	`1050`	`+Data filtration`
	`1051`	`+As you noticed in the video lesson, you may need to filter your data for various reasons.`
	`1052`	`+`
	`1053`	`+In this exercise, you will use filtering to select a specific part of our DataFrame:`
	`1054`	`+`
	`1055`	`+by the number of entries recorded in each day of the week`
	`1056`	`+by the mean amount of money the customers paid to the restaurant each day of the week`
	`1057`	`+Instructions 1/3`
	`1058`	`+35 XP`
	`1059`	`+1`
	`1060`	`+2`
	`1061`	`+3`
	`1062`	`+Create a new DataFrame containing only the days when the count of total_bill is greater than 40.`
	`1063`	`+`
	`1064`	`+# Filter the days where the count of total_bill is greater than 40ドル`
	`1065`	`+total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)`
	`1066`	`+`
	`1067`	`+# Print the number of tables where total_bill is greater than 40ドル`
	`1068`	`+print('Number of tables where total_bill is greater than 40ドル:', total_bill_40.shape[0])`
	`1069`	`+`
	`1070`	`+`
	`1071`	`+Instructions 2/3`
	`1072`	`+35 XP`
	`1073`	`+2`
	`1074`	`+3`
	`1075`	`+From the total_bill_40 DataFrame, select only the entries that have a mean total_bill greater than 20,ドル grouped by day.`
	`1076`	`+`
	`1077`	`+# Filter the days where the count of total_bill is greater than 40ドル`
	`1078`	`+total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)`
	`1079`	`+`
	`1080`	`+# Select only the entries that have a mean total_bill greater than 20ドル`
	`1081`	`+total_bill_20 = total_bill_40.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)`
	`1082`	`+`
	`1083`	`+# Print days of the week that have a mean total_bill greater than 20ドル`
	`1084`	`+print('Days of the week that have a mean total_bill greater than 20ドル:', total_bill_20.day.unique())`
`845`	`1085`
`846`		`-====================================================================================================================================`
`847`		`-VIEW CHAPTER DETAILS`
`848`		`-4`
`849`		`-Data manipulation using .groupby()`
`850`		`-0%`
`851`		`-This chapter describes the groupby() function and how we can use it to transform values in place, replace missing values and apply complex functions group-wise.`
`852`	`1086`
`853`		`-VIEW CHAPTER DETAILS`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 6b4842f

File tree

1 file changed

1 file changed

`‎DataCamp_Notes/Optimizing Python Code with pandas.txt‎`

0 commit comments