Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 8abf811

Browse files
Update Introduction to Machine Learning
1 parent a78f324 commit 8abf811

File tree

1 file changed

+85
-2
lines changed

1 file changed

+85
-2
lines changed

‎Introduction to Machine Learning

Lines changed: 85 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ An orchestrated and repeatable pattern which systematically transforms and proce
3030
4. Training the model;
3131
5. Testing the model;
3232

33+
------------------------------------------------------------------------------------------------------------------------
34+
3335
1. Asking the Right Question
3436
-----------------------------
3537
a. Define scope (including data sources);
@@ -45,7 +47,9 @@ d. Define how solution is created;
4547
- Use the Machine Learning Workflow to process and transform Pima Indian data to create a predictin model. This model
4648
must predict whih people are likely to develop diabetes with 70% or greater accuracy.
4749

48-
2. Preparing data
50+
---------------------------------------------------------------------------------------------------------------------------
51+
52+
2. Preparing data
4953
---------------------
5054
a. Tidy Data
5155
- Tidy datasets are easy to manipulate, model and visualize,and have a specific structure:
@@ -64,6 +68,9 @@ Data Rule #2:
6468
* Columns to eliminate - Not used, no values, duplicates;
6569
* Correlated columns - Same information in different format, add little value, and cause algorithm to get confused;
6670
* Modling Data - Adjusting data types, creating columns, if required;
71+
* Dealing with missing data -
72+
- Ignore it - Algorithms may fail;
73+
- Impute it - update to "reasonable" values - Most frequent, Mean, Median, Expert reasonable value;
6774

6875
Data Rule #3:
6976
----------------
@@ -73,7 +80,83 @@ Data Rule #4:
7380
--------------
7481
Track how to manipulate data;
7582

76-
3.
83+
-------------------------------------------------------------------------------------------------------------------------
84+
85+
3. Selecting the algorithm:
86+
------------------------------
87+
Role of the Algorithm
88+
- fit the training set and predict on the read data;
89+
- (fit()) training data -> Algorithm -> model;
90+
- (predict()) real data -> Model -> result;
91+
92+
Over 50 algorithms
93+
- algorithm selection
94+
*. Compare factors;
95+
*. Difference of opinions about which factors are important;
96+
*. Develop your own factors;
97+
98+
Algorithm Decision Factors
99+
--------------------------
100+
i. Learning Type
101+
ii. Result
102+
iii. Complexity
103+
iv. Basic vs Enhanced
104+
105+
i. Learning Type:
106+
"Use the Machine Learning Workflow to process and transform Pima Indian data to create a "prediction model". This model must
107+
predict which people are likely to develop diabetes with 70% or greater accuracy."
108+
109+
-> Prediction Model => Supervised machine learning;
110+
Over 28 algorithms
111+
112+
ii. Result
113+
a. Regression - constinuous vales;
114+
b. Classification - discrete values;
115+
116+
"Use the Machine Learning Workflow to process and transform Pima Indian data to create a prediction model. This model must
117+
"predict which people are likely to develop diabetes" with 70% or greater accuracy."
118+
119+
- Diabetes
120+
- Binary (True/False)
121+
- Algorithm must support classification - Binary classification;
122+
** Over 20 algorithms;
123+
124+
iii. Complexity
125+
- Keep it simple;
126+
- Eliminate ensemble algorithms - Container algorithm; Multiple child algorithm, boost performance, Can be difficult to debug;
127+
** Over 14 algorithm;
128+
129+
iv. Enhanced vs. Basic
130+
- Enhanced - variation of basic, performance improvements, additional functionality, more complex;
131+
- Basic - simpler, easier to understand;
132+
133+
Candidate Algorithms
134+
--------------------
135+
a. Naive Bayes;
136+
b. Logistics Regression;
137+
c. Decision Tree;
138+
139+
a. Naive Bayes - Based on likelihood and probability; every feature has same weight; requires smaller amount of data;
140+
b. Logistic Regression - Binary result, relation between features are weighted;
141+
c. Decision Tree - Binary tree, node contains decision, requires enough data to determine nodes and splits;
142+
143+
Selected algorithm - Naive Bayes
144+
-------------------------------
145+
Simple - easy to understand;
146+
Fast - up to 100X faster;
147+
Stable to data changes;
148+
149+
Overview
150+
----------
151+
Lots of algorithms available
152+
153+
Selected based on
154+
- Learning = Supervised
155+
- Result = Binary classification
156+
- Non-ensemble
157+
- Basic
158+
159+
77160

78161

79162

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /