@@ -30,6 +30,8 @@ An orchestrated and repeatable pattern which systematically transforms and proce
30
30
4. Training the model;
31
31
5. Testing the model;
32
32
33
+ ------------------------------------------------------------------------------------------------------------------------
34
+
33
35
1. Asking the Right Question
34
36
-----------------------------
35
37
a. Define scope (including data sources);
@@ -45,7 +47,9 @@ d. Define how solution is created;
45
47
- Use the Machine Learning Workflow to process and transform Pima Indian data to create a predictin model. This model
46
48
must predict whih people are likely to develop diabetes with 70% or greater accuracy.
47
49
48
- 2. Preparing data
50
+ ---------------------------------------------------------------------------------------------------------------------------
51
+
52
+ 2. Preparing data
49
53
---------------------
50
54
a. Tidy Data
51
55
- Tidy datasets are easy to manipulate, model and visualize,and have a specific structure:
@@ -64,6 +68,9 @@ Data Rule #2:
64
68
* Columns to eliminate - Not used, no values, duplicates;
65
69
* Correlated columns - Same information in different format, add little value, and cause algorithm to get confused;
66
70
* Modling Data - Adjusting data types, creating columns, if required;
71
+ * Dealing with missing data -
72
+ - Ignore it - Algorithms may fail;
73
+ - Impute it - update to "reasonable" values - Most frequent, Mean, Median, Expert reasonable value;
67
74
68
75
Data Rule #3:
69
76
----------------
@@ -73,7 +80,83 @@ Data Rule #4:
73
80
--------------
74
81
Track how to manipulate data;
75
82
76
- 3.
83
+ -------------------------------------------------------------------------------------------------------------------------
84
+
85
+ 3. Selecting the algorithm:
86
+ ------------------------------
87
+ Role of the Algorithm
88
+ - fit the training set and predict on the read data;
89
+ - (fit()) training data -> Algorithm -> model;
90
+ - (predict()) real data -> Model -> result;
91
+
92
+ Over 50 algorithms
93
+ - algorithm selection
94
+ *. Compare factors;
95
+ *. Difference of opinions about which factors are important;
96
+ *. Develop your own factors;
97
+
98
+ Algorithm Decision Factors
99
+ --------------------------
100
+ i. Learning Type
101
+ ii. Result
102
+ iii. Complexity
103
+ iv. Basic vs Enhanced
104
+
105
+ i. Learning Type:
106
+ "Use the Machine Learning Workflow to process and transform Pima Indian data to create a "prediction model". This model must
107
+ predict which people are likely to develop diabetes with 70% or greater accuracy."
108
+
109
+ -> Prediction Model => Supervised machine learning;
110
+ Over 28 algorithms
111
+
112
+ ii. Result
113
+ a. Regression - constinuous vales;
114
+ b. Classification - discrete values;
115
+
116
+ "Use the Machine Learning Workflow to process and transform Pima Indian data to create a prediction model. This model must
117
+ "predict which people are likely to develop diabetes" with 70% or greater accuracy."
118
+
119
+ - Diabetes
120
+ - Binary (True/False)
121
+ - Algorithm must support classification - Binary classification;
122
+ ** Over 20 algorithms;
123
+
124
+ iii. Complexity
125
+ - Keep it simple;
126
+ - Eliminate ensemble algorithms - Container algorithm; Multiple child algorithm, boost performance, Can be difficult to debug;
127
+ ** Over 14 algorithm;
128
+
129
+ iv. Enhanced vs. Basic
130
+ - Enhanced - variation of basic, performance improvements, additional functionality, more complex;
131
+ - Basic - simpler, easier to understand;
132
+
133
+ Candidate Algorithms
134
+ --------------------
135
+ a. Naive Bayes;
136
+ b. Logistics Regression;
137
+ c. Decision Tree;
138
+
139
+ a. Naive Bayes - Based on likelihood and probability; every feature has same weight; requires smaller amount of data;
140
+ b. Logistic Regression - Binary result, relation between features are weighted;
141
+ c. Decision Tree - Binary tree, node contains decision, requires enough data to determine nodes and splits;
142
+
143
+ Selected algorithm - Naive Bayes
144
+ -------------------------------
145
+ Simple - easy to understand;
146
+ Fast - up to 100X faster;
147
+ Stable to data changes;
148
+
149
+ Overview
150
+ ----------
151
+ Lots of algorithms available
152
+
153
+ Selected based on
154
+ - Learning = Supervised
155
+ - Result = Binary classification
156
+ - Non-ensemble
157
+ - Basic
158
+
159
+
77
160
78
161
79
162
0 commit comments