Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 6d83888

Browse files
committed
Add Cleaning_Data_for_Analysis notes
1 parent f047503 commit 6d83888

File tree

1 file changed

+242
-0
lines changed

1 file changed

+242
-0
lines changed
Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Cleaning Data for Analysis\n",
8+
"\n",
9+
"## Data Types\n",
10+
"\n",
11+
"There may be times we want to convert from one data type to another\n",
12+
"\n",
13+
"**Categorical Data**\n",
14+
"\n",
15+
"Columns that contain categorical data, such as Male / Female can be converting into 'category' dtype\n",
16+
"* Can make the DataFrame smaller in memory\n",
17+
"* Can make them be utilized by other Python libraries"
18+
]
19+
},
20+
{
21+
"cell_type": "code",
22+
"execution_count": 1,
23+
"metadata": {},
24+
"outputs": [
25+
{
26+
"name": "stdout",
27+
"output_type": "stream",
28+
"text": [
29+
"<class 'pandas.core.frame.DataFrame'>\n",
30+
"RangeIndex: 244 entries, 0 to 243\n",
31+
"Data columns (total 7 columns):\n",
32+
"total_bill 244 non-null float64\n",
33+
"tip 244 non-null float64\n",
34+
"sex 244 non-null object\n",
35+
"smoker 244 non-null object\n",
36+
"day 244 non-null object\n",
37+
"time 244 non-null object\n",
38+
"size 244 non-null int64\n",
39+
"dtypes: float64(2), int64(1), object(4)\n",
40+
"memory usage: 13.4+ KB\n"
41+
]
42+
}
43+
],
44+
"source": [
45+
"import pandas as pd\n",
46+
"df = pd.read_csv('https://assets.datacamp.com/production/repositories/666/datasets/b064fa9e0684a38ac15b0a19845367c29fde978d/tips.csv')\n",
47+
"df.info()"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": 2,
53+
"metadata": {},
54+
"outputs": [
55+
{
56+
"name": "stdout",
57+
"output_type": "stream",
58+
"text": [
59+
"<class 'pandas.core.frame.DataFrame'>\n",
60+
"RangeIndex: 244 entries, 0 to 243\n",
61+
"Data columns (total 7 columns):\n",
62+
"total_bill 244 non-null float64\n",
63+
"tip 244 non-null float64\n",
64+
"sex 244 non-null category\n",
65+
"smoker 244 non-null bool\n",
66+
"day 244 non-null object\n",
67+
"time 244 non-null object\n",
68+
"size 244 non-null int64\n",
69+
"dtypes: bool(1), category(1), float64(2), int64(1), object(2)\n",
70+
"memory usage: 10.2+ KB\n"
71+
]
72+
}
73+
],
74+
"source": [
75+
"# Converting Data Types\n",
76+
"df['smoker'] = df['smoker'].astype('bool')\n",
77+
"df['sex'] = df['sex'].astype('category')\n",
78+
"df.info()"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"metadata": {},
84+
"source": [
85+
"### Converting Data Types\n",
86+
"* Numeric data loaded as a string, usually a sign of bad data that needs to be cleaned"
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": 3,
92+
"metadata": {},
93+
"outputs": [
94+
{
95+
"data": {
96+
"text/plain": [
97+
"total_bill float64\n",
98+
"tip float64\n",
99+
"sex category\n",
100+
"smoker bool\n",
101+
"day object\n",
102+
"time object\n",
103+
"size int64\n",
104+
"dtype: object"
105+
]
106+
},
107+
"execution_count": 3,
108+
"metadata": {},
109+
"output_type": "execute_result"
110+
}
111+
],
112+
"source": [
113+
"# Converting total_bill into a numeric dtype\n",
114+
"# errors='coerce' will set invalid values as NaN\n",
115+
"df['total_bill'] = pd.to_numeric(df['total_bill'], errors='coerce')\n",
116+
"df['tip'] = pd.to_numeric(df['tip'], errors='coerce')\n",
117+
"df.dtypes"
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"metadata": {},
123+
"source": [
124+
"## String Manipulation\n",
125+
"\n",
126+
"* Much of data cleaning involves string manipulation\n",
127+
"* Most of the world's data is unstructured text\n",
128+
"* Python has many built-in and external libraries\n",
129+
"* 're' library for regular expressions\n",
130+
"\n",
131+
"### Regular Expression Match Example\n",
132+
"\n",
133+
"***** - Matches it zero or more times\n",
134+
"\n",
135+
"**{2}** - Matches exactly 2 values\n",
136+
"\n",
137+
"**^** - Caret will tell the pattern to start the pattern match th a the beginning of value\n",
138+
"\n",
139+
"**$** - Will tell the pattern to match at the end of the value\n",
140+
"\n",
141+
"|Value |Pattern Matched |Regular Expression|\n",
142+
"|-----------|-------------------|------------------|\n",
143+
"|17 |12345678901 |\\d* |\n",
144+
"|\\17ドル |\\12345678901ドル |\\ $\\d* |\n",
145+
"|\\17ドル.00 |\\12345678901ドル.24 |\\ \\$\\d*\\\\.\\d * |\n",
146+
"|\\17ドル.89 |\\12345678901ドル.24 |\\ \\$\\d*\\\\.\\d{2} |\n",
147+
"|\\17ドル.895 |\\12345678901ドル.999 |^\\\\$\\d*\\\\.\\d{2}\\$ |\n",
148+
"\n",
149+
"#### Using Regular Expressions\n",
150+
"\n",
151+
"* Compile the pattern\n",
152+
"* Use the compiled pattern to match values\n",
153+
"* This lets use use the pattern over and over again\n",
154+
"* Useful since we want to match values down a column of values"
155+
]
156+
},
157+
{
158+
"cell_type": "code",
159+
"execution_count": 4,
160+
"metadata": {},
161+
"outputs": [
162+
{
163+
"name": "stdout",
164+
"output_type": "stream",
165+
"text": [
166+
"True\n",
167+
"False\n"
168+
]
169+
}
170+
],
171+
"source": [
172+
"import re\n",
173+
"\n",
174+
"# RegEx Pattern - Match a Phone Number in the format of xxx-xxx-xxxx\n",
175+
"pattern = re.compile('\\d{3}\\-\\d{3}\\-\\d{4}')\n",
176+
"\n",
177+
"# See if the pattern matches\n",
178+
"result = pattern.match('123-456-7890')\n",
179+
"result2 = pattern.match('1123-456-7890')\n",
180+
"\n",
181+
"print(f'{bool(result)}')\n",
182+
"print(f'{bool(result2)}')"
183+
]
184+
},
185+
{
186+
"cell_type": "code",
187+
"execution_count": 5,
188+
"metadata": {},
189+
"outputs": [
190+
{
191+
"ename": "TypeError",
192+
"evalue": "findall() missing 1 required positional argument: 'string'",
193+
"output_type": "error",
194+
"traceback": [
195+
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
196+
"\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)",
197+
"\u001b[1;32m<ipython-input-5-d636d1f14eb0>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# Find the numeric values in a string\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mmatches\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'\\d*'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
198+
"\u001b[1;31mTypeError\u001b[0m: findall() missing 1 required positional argument: 'string'"
199+
]
200+
}
201+
],
202+
"source": [
203+
"# Find the numeric values in a string\n",
204+
"matches = re.findall('\\d*')"
205+
]
206+
}
207+
],
208+
"metadata": {
209+
"kernelspec": {
210+
"display_name": "Python 3",
211+
"language": "python",
212+
"name": "python3"
213+
},
214+
"language_info": {
215+
"codemirror_mode": {
216+
"name": "ipython",
217+
"version": 3
218+
},
219+
"file_extension": ".py",
220+
"mimetype": "text/x-python",
221+
"name": "python",
222+
"nbconvert_exporter": "python",
223+
"pygments_lexer": "ipython3",
224+
"version": "3.7.3"
225+
},
226+
"toc": {
227+
"base_numbering": 1,
228+
"nav_menu": {},
229+
"number_sections": true,
230+
"sideBar": true,
231+
"skip_h1_title": false,
232+
"title_cell": "Table of Contents",
233+
"title_sidebar": "Contents",
234+
"toc_cell": false,
235+
"toc_position": {},
236+
"toc_section_display": true,
237+
"toc_window_display": false
238+
}
239+
},
240+
"nbformat": 4,
241+
"nbformat_minor": 2
242+
}

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /