Commit 6d83888

committed

Add Cleaning_Data_for_Analysis notes

1 parent f047503 commit 6d83888Copy full SHA for 6d83888

File tree

1 file changed

+242

-0

lines changed

Cleaning_Data_in_Python
- Cleaning_Data_for_Analysis.ipynb

1 file changed

+242

-0

lines changed

`‎Cleaning_Data_in_Python/Cleaning_Data_for_Analysis.ipynb`

Lines changed: 242 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,242 @@`
	`1`	`+{`
	`2`	`+ "cells": [`
	`3`	`+ {`
	`4`	`+ "cell_type": "markdown",`
	`5`	`+ "metadata": {},`
	`6`	`+ "source": [`
	`7`	`+ "# Cleaning Data for Analysis\n",`
	`8`	`+ "\n",`
	`9`	`+ "## Data Types\n",`
	`10`	`+ "\n",`
	`11`	`+ "There may be times we want to convert from one data type to another\n",`
	`12`	`+ "\n",`
	`13`	`+ "Categorical Data\n",`
	`14`	`+ "\n",`
	`15`	`+ "Columns that contain categorical data, such as Male / Female can be converting into 'category' dtype\n",`
	`16`	`+ "* Can make the DataFrame smaller in memory\n",`
	`17`	`+ "* Can make them be utilized by other Python libraries"`
	`18`	`+ ]`
	`19`	`+ },`
	`20`	`+ {`
	`21`	`+ "cell_type": "code",`
	`22`	`+ "execution_count": 1,`
	`23`	`+ "metadata": {},`
	`24`	`+ "outputs": [`
	`25`	`+ {`
	`26`	`+ "name": "stdout",`
	`27`	`+ "output_type": "stream",`
	`28`	`+ "text": [`
	`29`	`+ "<class 'pandas.core.frame.DataFrame'>\n",`
	`30`	`+ "RangeIndex: 244 entries, 0 to 243\n",`
	`31`	`+ "Data columns (total 7 columns):\n",`
	`32`	`+ "total_bill 244 non-null float64\n",`
	`33`	`+ "tip 244 non-null float64\n",`
	`34`	`+ "sex 244 non-null object\n",`
	`35`	`+ "smoker 244 non-null object\n",`
	`36`	`+ "day 244 non-null object\n",`
	`37`	`+ "time 244 non-null object\n",`
	`38`	`+ "size 244 non-null int64\n",`
	`39`	`+ "dtypes: float64(2), int64(1), object(4)\n",`
	`40`	`+ "memory usage: 13.4+ KB\n"`
	`41`	`+ ]`
	`42`	`+ }`
	`43`	`+ ],`
	`44`	`+ "source": [`
	`45`	`+ "import pandas as pd\n",`
	`46`	`+ "df = pd.read_csv('https://assets.datacamp.com/production/repositories/666/datasets/b064fa9e0684a38ac15b0a19845367c29fde978d/tips.csv')\n",`
	`47`	`+ "df.info()"`
	`48`	`+ ]`
	`49`	`+ },`
	`50`	`+ {`
	`51`	`+ "cell_type": "code",`
	`52`	`+ "execution_count": 2,`
	`53`	`+ "metadata": {},`
	`54`	`+ "outputs": [`
	`55`	`+ {`
	`56`	`+ "name": "stdout",`
	`57`	`+ "output_type": "stream",`
	`58`	`+ "text": [`
	`59`	`+ "<class 'pandas.core.frame.DataFrame'>\n",`
	`60`	`+ "RangeIndex: 244 entries, 0 to 243\n",`
	`61`	`+ "Data columns (total 7 columns):\n",`
	`62`	`+ "total_bill 244 non-null float64\n",`
	`63`	`+ "tip 244 non-null float64\n",`
	`64`	`+ "sex 244 non-null category\n",`
	`65`	`+ "smoker 244 non-null bool\n",`
	`66`	`+ "day 244 non-null object\n",`
	`67`	`+ "time 244 non-null object\n",`
	`68`	`+ "size 244 non-null int64\n",`
	`69`	`+ "dtypes: bool(1), category(1), float64(2), int64(1), object(2)\n",`
	`70`	`+ "memory usage: 10.2+ KB\n"`
	`71`	`+ ]`
	`72`	`+ }`
	`73`	`+ ],`
	`74`	`+ "source": [`
	`75`	`+ "# Converting Data Types\n",`
	`76`	`+ "df['smoker'] = df['smoker'].astype('bool')\n",`
	`77`	`+ "df['sex'] = df['sex'].astype('category')\n",`
	`78`	`+ "df.info()"`
	`79`	`+ ]`
	`80`	`+ },`
	`81`	`+ {`
	`82`	`+ "cell_type": "markdown",`
	`83`	`+ "metadata": {},`
	`84`	`+ "source": [`
	`85`	`+ "### Converting Data Types\n",`
	`86`	`+ "* Numeric data loaded as a string, usually a sign of bad data that needs to be cleaned"`
	`87`	`+ ]`
	`88`	`+ },`
	`89`	`+ {`
	`90`	`+ "cell_type": "code",`
	`91`	`+ "execution_count": 3,`
	`92`	`+ "metadata": {},`
	`93`	`+ "outputs": [`
	`94`	`+ {`
	`95`	`+ "data": {`
	`96`	`+ "text/plain": [`
	`97`	`+ "total_bill float64\n",`
	`98`	`+ "tip float64\n",`
	`99`	`+ "sex category\n",`
	`100`	`+ "smoker bool\n",`
	`101`	`+ "day object\n",`
	`102`	`+ "time object\n",`
	`103`	`+ "size int64\n",`
	`104`	`+ "dtype: object"`
	`105`	`+ ]`
	`106`	`+ },`
	`107`	`+ "execution_count": 3,`
	`108`	`+ "metadata": {},`
	`109`	`+ "output_type": "execute_result"`
	`110`	`+ }`
	`111`	`+ ],`
	`112`	`+ "source": [`
	`113`	`+ "# Converting total_bill into a numeric dtype\n",`
	`114`	`+ "# errors='coerce' will set invalid values as NaN\n",`
	`115`	`+ "df['total_bill'] = pd.to_numeric(df['total_bill'], errors='coerce')\n",`
	`116`	`+ "df['tip'] = pd.to_numeric(df['tip'], errors='coerce')\n",`
	`117`	`+ "df.dtypes"`
	`118`	`+ ]`
	`119`	`+ },`
	`120`	`+ {`
	`121`	`+ "cell_type": "markdown",`
	`122`	`+ "metadata": {},`
	`123`	`+ "source": [`
	`124`	`+ "## String Manipulation\n",`
	`125`	`+ "\n",`
	`126`	`+ "* Much of data cleaning involves string manipulation\n",`
	`127`	`+ "* Most of the world's data is unstructured text\n",`
	`128`	`+ "* Python has many built-in and external libraries\n",`
	`129`	`+ "* 're' library for regular expressions\n",`
	`130`	`+ "\n",`
	`131`	`+ "### Regular Expression Match Example\n",`
	`132`	`+ "\n",`
	`133`	`+ "***** - Matches it zero or more times\n",`
	`134`	`+ "\n",`
	`135`	`+ "{2} - Matches exactly 2 values\n",`
	`136`	`+ "\n",`
	`137`	`+ "^ - Caret will tell the pattern to start the pattern match th a the beginning of value\n",`
	`138`	`+ "\n",`
	`139`	`+ "$ - Will tell the pattern to match at the end of the value\n",`
	`140`	`+ "\n",`
	`141`	`+ "\|Value \|Pattern Matched \|Regular Expression\|\n",`
	`142`	`+ "\|-----------\|-------------------\|------------------\|\n",`
	`143`	`+ "\|17 \|12345678901 \|\\d* \|\n",`
	`144`	`+ "\|\\17ドル \|\\12345678901ドル \|\\ $\\d* \|\n",`
	`145`	`+ "\|\\17ドル.00 \|\\12345678901ドル.24 \|\\ \\$\\d\\\\.\\d \|\n",`
	`146`	`+ "\|\\17ドル.89 \|\\12345678901ドル.24 \|\\ \\$\\d*\\\\.\\d{2} \|\n",`
	`147`	`+ "\|\\17ドル.895 \|\\12345678901ドル.999 \|^\\\\$\\d*\\\\.\\d{2}\\$ \|\n",`
	`148`	`+ "\n",`
	`149`	`+ "#### Using Regular Expressions\n",`
	`150`	`+ "\n",`
	`151`	`+ "* Compile the pattern\n",`
	`152`	`+ "* Use the compiled pattern to match values\n",`
	`153`	`+ "* This lets use use the pattern over and over again\n",`
	`154`	`+ "* Useful since we want to match values down a column of values"`
	`155`	`+ ]`
	`156`	`+ },`
	`157`	`+ {`
	`158`	`+ "cell_type": "code",`
	`159`	`+ "execution_count": 4,`
	`160`	`+ "metadata": {},`
	`161`	`+ "outputs": [`
	`162`	`+ {`
	`163`	`+ "name": "stdout",`
	`164`	`+ "output_type": "stream",`
	`165`	`+ "text": [`
	`166`	`+ "True\n",`
	`167`	`+ "False\n"`
	`168`	`+ ]`
	`169`	`+ }`
	`170`	`+ ],`
	`171`	`+ "source": [`
	`172`	`+ "import re\n",`
	`173`	`+ "\n",`
	`174`	`+ "# RegEx Pattern - Match a Phone Number in the format of xxx-xxx-xxxx\n",`
	`175`	`+ "pattern = re.compile('\\d{3}\\-\\d{3}\\-\\d{4}')\n",`
	`176`	`+ "\n",`
	`177`	`+ "# See if the pattern matches\n",`
	`178`	`+ "result = pattern.match('123-456-7890')\n",`
	`179`	`+ "result2 = pattern.match('1123-456-7890')\n",`
	`180`	`+ "\n",`
	`181`	`+ "print(f'{bool(result)}')\n",`
	`182`	`+ "print(f'{bool(result2)}')"`
	`183`	`+ ]`
	`184`	`+ },`
	`185`	`+ {`
	`186`	`+ "cell_type": "code",`
	`187`	`+ "execution_count": 5,`
	`188`	`+ "metadata": {},`
	`189`	`+ "outputs": [`
	`190`	`+ {`
	`191`	`+ "ename": "TypeError",`
	`192`	`+ "evalue": "findall() missing 1 required positional argument: 'string'",`
	`193`	`+ "output_type": "error",`
	`194`	`+ "traceback": [`
	`195`	`+ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",`
	`196`	`+ "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",`
	`197`	+ "\u001b[1;32m<ipython-input-5-d636d1f14eb0>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Find the numeric values in a string\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mmatches\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'\\d*'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
	`198`	`+ "\u001b[1;31mTypeError\u001b[0m: findall() missing 1 required positional argument: 'string'"`
	`199`	`+ ]`
	`200`	`+ }`
	`201`	`+ ],`
	`202`	`+ "source": [`
	`203`	`+ "# Find the numeric values in a string\n",`
	`204`	`+ "matches = re.findall('\\d*')"`
	`205`	`+ ]`
	`206`	`+ }`
	`207`	`+ ],`
	`208`	`+ "metadata": {`
	`209`	`+ "kernelspec": {`
	`210`	`+ "display_name": "Python 3",`
	`211`	`+ "language": "python",`
	`212`	`+ "name": "python3"`
	`213`	`+ },`
	`214`	`+ "language_info": {`
	`215`	`+ "codemirror_mode": {`
	`216`	`+ "name": "ipython",`
	`217`	`+ "version": 3`
	`218`	`+ },`
	`219`	`+ "file_extension": ".py",`
	`220`	`+ "mimetype": "text/x-python",`
	`221`	`+ "name": "python",`
	`222`	`+ "nbconvert_exporter": "python",`
	`223`	`+ "pygments_lexer": "ipython3",`
	`224`	`+ "version": "3.7.3"`
	`225`	`+ },`
	`226`	`+ "toc": {`
	`227`	`+ "base_numbering": 1,`
	`228`	`+ "nav_menu": {},`
	`229`	`+ "number_sections": true,`
	`230`	`+ "sideBar": true,`
	`231`	`+ "skip_h1_title": false,`
	`232`	`+ "title_cell": "Table of Contents",`
	`233`	`+ "title_sidebar": "Contents",`
	`234`	`+ "toc_cell": false,`
	`235`	`+ "toc_position": {},`
	`236`	`+ "toc_section_display": true,`
	`237`	`+ "toc_window_display": false`
	`238`	`+ }`
	`239`	`+ },`
	`240`	`+ "nbformat": 4,`
	`241`	`+ "nbformat_minor": 2`
	`242`	`+}`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 6d83888

File tree

1 file changed

1 file changed

`‎Cleaning_Data_in_Python/Cleaning_Data_for_Analysis.ipynb`

0 commit comments