Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit a60f70d

Browse files
committed
Add starting code for Case Study - Gapminder Foundation dataset
1 parent 1b0ecef commit a60f70d

File tree

1 file changed

+143
-1
lines changed

1 file changed

+143
-1
lines changed

‎Cleaning_Data_in_Python/Cleaning_Data_for_Analysis.ipynb

Lines changed: 143 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -928,6 +928,143 @@
928928
{
929929
"cell_type": "markdown",
930930
"metadata": {},
931+
"source": [
932+
"## Case Study- Putting it all together\n",
933+
"\n",
934+
"* Using all techniques previously discussed to work on the Gapminder Foundation dataset.\n",
935+
"* Clean and tidy data saved to a file\n",
936+
" * Ready to be loaded for analysis!\n",
937+
"* Dataset consists of life expectancy by country and year\n",
938+
"* Data will come in multiple parts\n",
939+
" * Load\n",
940+
" * Preliminary quality diagnosis\n",
941+
" * Combine into single dataset\n",
942+
"\n",
943+
"### Useful methods\n",
944+
"* df = pd.read_csv('file_name.csv')\n",
945+
"* df.head()\n",
946+
"* df.info()\n",
947+
"* df.columns\n",
948+
"* df.describe()\n",
949+
"* df.column.value_counts()\n",
950+
"* df.column.plot('hist')\n",
951+
"\n",
952+
"### Data Quality\n",
953+
"```python\n",
954+
"def cleaning_function(row_data):\n",
955+
" # data cleaning logic\n",
956+
" return ...\n",
957+
"# Default: Axis = 0 will apply the function column-wise\n",
958+
"# Axis = 1 will apply the function row-wise\n",
959+
"df.apply(cleaning_function, axis=1)\n",
960+
"assert(df.column_data > 0).all()\n",
961+
"```\n",
962+
"\n",
963+
"### Combining Data\n",
964+
"* pd.merge(df1, df2, ...)\n",
965+
"* pd.concat([df1, df2, ...])"
966+
]
967+
},
968+
{
969+
"cell_type": "code",
970+
"execution_count": 18,
971+
"metadata": {},
972+
"outputs": [
973+
{
974+
"name": "stdout",
975+
"output_type": "stream",
976+
"text": [
977+
"<class 'pandas.core.frame.DataFrame'>\n",
978+
"RangeIndex: 780 entries, 0 to 779\n",
979+
"Columns: 219 entries, Unnamed: 0 to Life expectancy\n",
980+
"dtypes: float64(217), int64(1), object(1)\n",
981+
"memory usage: 1.3+ MB\n"
982+
]
983+
}
984+
],
985+
"source": [
986+
"file ='https://assets.datacamp.com/production/repositories/666/datasets/8e869c545c913547d94b61534b2f8d336a2c8c87/gapminder.csv'\n",
987+
"df = pd.read_csv(file)\n",
988+
"df.info()"
989+
]
990+
},
991+
{
992+
"cell_type": "code",
993+
"execution_count": 19,
994+
"metadata": {},
995+
"outputs": [
996+
{
997+
"data": {
998+
"text/plain": [
999+
"(20, 55)"
1000+
]
1001+
},
1002+
"execution_count": 19,
1003+
"metadata": {},
1004+
"output_type": "execute_result"
1005+
}
1006+
],
1007+
"source": [
1008+
"import matplotlib.pyplot as plt\n",
1009+
"\n",
1010+
"# Create the scatter plot\n",
1011+
"df.plot(kind='scatter', x='1800', y='1899')\n",
1012+
"\n",
1013+
"# Specify axis limitsfy axis labels\n",
1014+
"plt.xlabel('Life Expectancy by Country in 1800')\n",
1015+
"plt.ylabel('Life Expectancy by Country in 1899')\n",
1016+
"\n",
1017+
"# Specify axis limits\n",
1018+
"plt.xlim(20, 55)\n",
1019+
"plt.ylim(20, 55)"
1020+
]
1021+
},
1022+
{
1023+
"cell_type": "markdown",
1024+
"metadata": {},
1025+
"source": [
1026+
"* Looking at the graph, we can see whether the scatter plot takes the form a diagonal line, and which points fall below or above the diagonal line.\n",
1027+
"* Points that fall above or below the diagonal line inform us how life expectancy in 1899 changed (or did not change) when compared to 1800 for different countries.\n",
1028+
"* Note: When the points fall on the diagonal line, it means that life expectancy remained the same"
1029+
]
1030+
},
1031+
{
1032+
"cell_type": "code",
1033+
"execution_count": 20,
1034+
"metadata": {},
1035+
"outputs": [
1036+
{
1037+
"ename": "TypeError",
1038+
"evalue": "apply() missing 1 required positional argument: 'func'",
1039+
"output_type": "error",
1040+
"traceback": [
1041+
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
1042+
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
1043+
"\u001b[1;32m<ipython-input-20-d4281472ee1b>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[1;31m# Note: loc gets rows (or columns) with particular 'labels'\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[1;31m# iloc gets row (or columns) at a particular 'index position'\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 12\u001b[1;33m \u001b[1;32massert\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0miloc\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m:\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
1044+
"\u001b[1;31mTypeError\u001b[0m: apply() missing 1 required positional argument: 'func'"
1045+
]
1046+
}
1047+
],
1048+
"source": [
1049+
"def check_null_or_valid(row):\n",
1050+
" \"\"\"Function that takes a row of data, drops all missing values, and checks if all remaining values are >= 0\"\"\"\n",
1051+
" no_na = row.dropna()\n",
1052+
" numeric_type = no_na.to_numeric()\n",
1053+
"\n",
1054+
"# Check whether the last column is 'Life expectancy'\n",
1055+
"assert df.columns[-1] == 'Life expectancy'\n",
1056+
"\n",
1057+
"# Check whether the values in the row are valid\n",
1058+
"# Note: loc gets rows (or columns) with particular 'labels'\n",
1059+
"# iloc gets row (or columns) at a particular 'index position'\n",
1060+
"assert df.iloc[: , :].apply()"
1061+
]
1062+
},
1063+
{
1064+
"cell_type": "code",
1065+
"execution_count": null,
1066+
"metadata": {},
1067+
"outputs": [],
9311068
"source": []
9321069
}
9331070
],
@@ -958,7 +1095,12 @@
9581095
"title_cell": "Table of Contents",
9591096
"title_sidebar": "Contents",
9601097
"toc_cell": false,
961-
"toc_position": {},
1098+
"toc_position": {
1099+
"height": "calc(100% - 180px)",
1100+
"left": "10px",
1101+
"top": "150px",
1102+
"width": "288px"
1103+
},
9621104
"toc_section_display": true,
9631105
"toc_window_display": true
9641106
}

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /