|
| 1 | +Machine Learning |
| 2 | +------------------ |
| 3 | +Building a model from example inputs to make data-driven predictions versus following strictly static program instructions. |
| 4 | +Application: |
| 5 | + |
| 6 | +1. Email a spam? |
| 7 | +2. How can cars drive themselves? |
| 8 | +3. What will people buy? |
| 9 | + |
| 10 | +Machine Learning |
| 11 | +----------------- |
| 12 | +2 categories |
| 13 | +a. Supervised; |
| 14 | + - Value Prediction |
| 15 | + - Needs training data containing value being predicted, the trained model predicts value in the new model; |
| 16 | +b. Unsupervised; |
| 17 | + - Identify clusters of like data; |
| 18 | + - Data does not contain cluster membership, but model provides access to data by cluster; |
| 19 | + |
| 20 | +url -> https://www.continuum.io/downloads |
| 21 | + |
| 22 | + |
| 23 | +Machine Learning WorkFlow: |
| 24 | +-------------------------- |
| 25 | +An orchestrated and repeatable pattern which systematically transforms and processes information to create prediction solutions. |
| 26 | + |
| 27 | +1. Asking the right question; |
| 28 | +2. Preparing data; |
| 29 | +3. Selecting the algorithm; |
| 30 | +4. Training the model; |
| 31 | +5. Testing the model; |
| 32 | + |
| 33 | +1. Asking the Right Question |
| 34 | +----------------------------- |
| 35 | +a. Define scope (including data sources); |
| 36 | + - Using Pima Indian Diabetes data, predict which people will develop diabetes. |
| 37 | + |
| 38 | +b. Define target performance; |
| 39 | + - Using Pima Indian Diabetes data, predict with 70% or grater accuracy, which people will develop diabetes. |
| 40 | + |
| 41 | +c. Define context for usage; |
| 42 | + - Using Pima Indian Diabetes data, predict with 70% or greater accuracy which people are likely to develop diabetes. |
| 43 | + |
| 44 | +d. Define how solution is created; |
| 45 | + - Use the Machine Learning Workflow to process and transform Pima Indian data to create a predictin model. This model |
| 46 | + must predict whih people are likely to develop diabetes with 70% or greater accuracy. |
| 47 | + |
| 48 | + 2. Preparing data |
| 49 | + --------------------- |
| 50 | + a. Tidy Data |
| 51 | + - Tidy datasets are easy to manipulate, model and visualize,and have a specific structure: |
| 52 | + * each variable is a column; |
| 53 | + * each observation is a row; |
| 54 | + * each type of observational unit is a table; |
| 55 | + ** 50 - 80% of a ML project is spent getting, cleaning, and organizing data; |
| 56 | + |
| 57 | +Data Rule #1: |
| 58 | +--------------- |
| 59 | +- Closer the data is to what you are predicting, the better; |
| 60 | + |
| 61 | +Data Rule #2: |
| 62 | +-------------- |
| 63 | +- Data will never be in the format you need; |
| 64 | +* Columns to eliminate - Not used, no values, duplicates; |
| 65 | +* Correlated columns - Same information in different format, add little value, and cause algorithm to get confused; |
| 66 | +* Modling Data - Adjusting data types, creating columns, if required; |
| 67 | + |
| 68 | +Data Rule #3: |
| 69 | +---------------- |
| 70 | +Accurately predicting rare events is difficule; |
| 71 | + |
| 72 | +Data Rule #4: |
| 73 | +-------------- |
| 74 | +Track how to manipulate data; |
| 75 | + |
| 76 | +3. |
| 77 | + |
| 78 | + |
| 79 | + |
0 commit comments