Commit a60f70d

committed

Add starting code for Case Study - Gapminder Foundation dataset

1 parent 1b0ecef commit a60f70dCopy full SHA for a60f70d

File tree

1 file changed

+143

-1

lines changed

Cleaning_Data_in_Python
- Cleaning_Data_for_Analysis.ipynb

1 file changed

+143

-1

lines changed

`‎Cleaning_Data_in_Python/Cleaning_Data_for_Analysis.ipynb`

Lines changed: 143 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -928,6 +928,143 @@`
`928`	`928`	`{`
`929`	`929`	`"cell_type": "markdown",`
`930`	`930`	`"metadata": {},`
	`931`	`+ "source": [`
	`932`	`+ "## Case Study- Putting it all together\n",`
	`933`	`+ "\n",`
	`934`	`+ "* Using all techniques previously discussed to work on the Gapminder Foundation dataset.\n",`
	`935`	`+ "* Clean and tidy data saved to a file\n",`
	`936`	`+ " * Ready to be loaded for analysis!\n",`
	`937`	`+ "* Dataset consists of life expectancy by country and year\n",`
	`938`	`+ "* Data will come in multiple parts\n",`
	`939`	`+ " * Load\n",`
	`940`	`+ " * Preliminary quality diagnosis\n",`
	`941`	`+ " * Combine into single dataset\n",`
	`942`	`+ "\n",`
	`943`	`+ "### Useful methods\n",`
	`944`	`+ "* df = pd.read_csv('file_name.csv')\n",`
	`945`	`+ "* df.head()\n",`
	`946`	`+ "* df.info()\n",`
	`947`	`+ "* df.columns\n",`
	`948`	`+ "* df.describe()\n",`
	`949`	`+ "* df.column.value_counts()\n",`
	`950`	`+ "* df.column.plot('hist')\n",`
	`951`	`+ "\n",`
	`952`	`+ "### Data Quality\n",`
	`953`	+ "```python\n",
	`954`	`+ "def cleaning_function(row_data):\n",`
	`955`	`+ " # data cleaning logic\n",`
	`956`	`+ " return ...\n",`
	`957`	`+ "# Default: Axis = 0 will apply the function column-wise\n",`
	`958`	`+ "# Axis = 1 will apply the function row-wise\n",`
	`959`	`+ "df.apply(cleaning_function, axis=1)\n",`
	`960`	`+ "assert(df.column_data > 0).all()\n",`
	`961`	+ "```\n",
	`962`	`+ "\n",`
	`963`	`+ "### Combining Data\n",`
	`964`	`+ "* pd.merge(df1, df2, ...)\n",`
	`965`	`+ "* pd.concat([df1, df2, ...])"`
	`966`	`+ ]`
	`967`	`+ },`
	`968`	`+ {`
	`969`	`+ "cell_type": "code",`
	`970`	`+ "execution_count": 18,`
	`971`	`+ "metadata": {},`
	`972`	`+ "outputs": [`
	`973`	`+ {`
	`974`	`+ "name": "stdout",`
	`975`	`+ "output_type": "stream",`
	`976`	`+ "text": [`
	`977`	`+ "<class 'pandas.core.frame.DataFrame'>\n",`
	`978`	`+ "RangeIndex: 780 entries, 0 to 779\n",`
	`979`	`+ "Columns: 219 entries, Unnamed: 0 to Life expectancy\n",`
	`980`	`+ "dtypes: float64(217), int64(1), object(1)\n",`
	`981`	`+ "memory usage: 1.3+ MB\n"`
	`982`	`+ ]`
	`983`	`+ }`
	`984`	`+ ],`
	`985`	`+ "source": [`
	`986`	`+ "file ='https://assets.datacamp.com/production/repositories/666/datasets/8e869c545c913547d94b61534b2f8d336a2c8c87/gapminder.csv'\n",`
	`987`	`+ "df = pd.read_csv(file)\n",`
	`988`	`+ "df.info()"`
	`989`	`+ ]`
	`990`	`+ },`
	`991`	`+ {`
	`992`	`+ "cell_type": "code",`
	`993`	`+ "execution_count": 19,`
	`994`	`+ "metadata": {},`
	`995`	`+ "outputs": [`
	`996`	`+ {`
	`997`	`+ "data": {`
	`998`	`+ "text/plain": [`
	`999`	`+ "(20, 55)"`
	`1000`	`+ ]`
	`1001`	`+ },`
	`1002`	`+ "execution_count": 19,`
	`1003`	`+ "metadata": {},`
	`1004`	`+ "output_type": "execute_result"`
	`1005`	`+ }`
	`1006`	`+ ],`
	`1007`	`+ "source": [`
	`1008`	`+ "import matplotlib.pyplot as plt\n",`
	`1009`	`+ "\n",`
	`1010`	`+ "# Create the scatter plot\n",`
	`1011`	`+ "df.plot(kind='scatter', x='1800', y='1899')\n",`
	`1012`	`+ "\n",`
	`1013`	`+ "# Specify axis limitsfy axis labels\n",`
	`1014`	`+ "plt.xlabel('Life Expectancy by Country in 1800')\n",`
	`1015`	`+ "plt.ylabel('Life Expectancy by Country in 1899')\n",`
	`1016`	`+ "\n",`
	`1017`	`+ "# Specify axis limits\n",`
	`1018`	`+ "plt.xlim(20, 55)\n",`
	`1019`	`+ "plt.ylim(20, 55)"`
	`1020`	`+ ]`
	`1021`	`+ },`
	`1022`	`+ {`
	`1023`	`+ "cell_type": "markdown",`
	`1024`	`+ "metadata": {},`
	`1025`	`+ "source": [`
	`1026`	`+ "* Looking at the graph, we can see whether the scatter plot takes the form a diagonal line, and which points fall below or above the diagonal line.\n",`
	`1027`	`+ "* Points that fall above or below the diagonal line inform us how life expectancy in 1899 changed (or did not change) when compared to 1800 for different countries.\n",`
	`1028`	`+ "* Note: When the points fall on the diagonal line, it means that life expectancy remained the same"`
	`1029`	`+ ]`
	`1030`	`+ },`
	`1031`	`+ {`
	`1032`	`+ "cell_type": "code",`
	`1033`	`+ "execution_count": 20,`
	`1034`	`+ "metadata": {},`
	`1035`	`+ "outputs": [`
	`1036`	`+ {`
	`1037`	`+ "ename": "TypeError",`
	`1038`	`+ "evalue": "apply() missing 1 required positional argument: 'func'",`
	`1039`	`+ "output_type": "error",`
	`1040`	`+ "traceback": [`
	`1041`	`+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",`
	`1042`	`+ "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",`
	`1043`	+ "\u001b[1;32m<ipython-input-20-d4281472ee1b>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[1;31m# Note: loc gets rows (or columns) with particular 'labels'\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 11\u001b[0m \u001b[1;31m# iloc gets row (or columns) at a particular 'index position'\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 12\u001b[1;33m \u001b[1;32massert\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0miloc\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m:\u001b[0m \u001b[1;33m,\u001b[0m \u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	`1044`	`+ "\u001b[1;31mTypeError\u001b[0m: apply() missing 1 required positional argument: 'func'"`
	`1045`	`+ ]`
	`1046`	`+ }`
	`1047`	`+ ],`
	`1048`	`+ "source": [`
	`1049`	`+ "def check_null_or_valid(row):\n",`
	`1050`	`+ " \"\"\"Function that takes a row of data, drops all missing values, and checks if all remaining values are >= 0\"\"\"\n",`
	`1051`	`+ " no_na = row.dropna()\n",`
	`1052`	`+ " numeric_type = no_na.to_numeric()\n",`
	`1053`	`+ "\n",`
	`1054`	`+ "# Check whether the last column is 'Life expectancy'\n",`
	`1055`	`+ "assert df.columns[-1] == 'Life expectancy'\n",`
	`1056`	`+ "\n",`
	`1057`	`+ "# Check whether the values in the row are valid\n",`
	`1058`	`+ "# Note: loc gets rows (or columns) with particular 'labels'\n",`
	`1059`	`+ "# iloc gets row (or columns) at a particular 'index position'\n",`
	`1060`	`+ "assert df.iloc[: , :].apply()"`
	`1061`	`+ ]`
	`1062`	`+ },`
	`1063`	`+ {`
	`1064`	`+ "cell_type": "code",`
	`1065`	`+ "execution_count": null,`
	`1066`	`+ "metadata": {},`
	`1067`	`+ "outputs": [],`
`931`	`1068`	`"source": []`
`932`	`1069`	`}`
`933`	`1070`	`],`
`@@ -958,7 +1095,12 @@`
`958`	`1095`	`"title_cell": "Table of Contents",`
`959`	`1096`	`"title_sidebar": "Contents",`
`960`	`1097`	`"toc_cell": false,`
`961`		`- "toc_position": {},`
	`1098`	`+ "toc_position": {`
	`1099`	`+ "height": "calc(100% - 180px)",`
	`1100`	`+ "left": "10px",`
	`1101`	`+ "top": "150px",`
	`1102`	`+ "width": "288px"`
	`1103`	`+ },`
`962`	`1104`	`"toc_section_display": true,`
`963`	`1105`	`"toc_window_display": true`
`964`	`1106`	`}`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit a60f70d

File tree

1 file changed

1 file changed

`‎Cleaning_Data_in_Python/Cleaning_Data_for_Analysis.ipynb`

0 commit comments