|
2 | 2 |
|
3 | 3 | Submit source code and write-up (including program output) through Sakai.
|
4 | 4 |
|
5 | | -## Instructions |
| 5 | +## Background |
6 | 6 |
|
| 7 | +A bunch of your friends really like wine, specifically Portuguese wine. One |
| 8 | +night you all are up all night debating on what physicohemical aspects of wine |
| 9 | +(like pH or acidity) make good wine. Being the sleuth you are, you find out |
| 10 | +that there happens to be [a study and dataset][wine] looking at just this! |
7 | 11 |
|
| 12 | +You are conveniently learning Python and bash scripting, and figured this may |
| 13 | +be a good opportunity to provide some evidence for what may be contributing to |
| 14 | +good wine. |
8 | 15 |
|
9 | | -## Deliverables |
| 16 | +[wine]: http://archive.ics.uci.edu/ml/datasets/Wine+Qualityhttp://archive.ics.uci.edu/ml/datasets/Wine+Quality |
| 17 | + |
| 18 | + |
| 19 | +## Problem |
| 20 | + |
| 21 | +The study you reference looked at both red and white wine and you want to find |
| 22 | +out what makes good red and white wine. You wish to conduct a very simple |
| 23 | +analysis. |
| 24 | + |
| 25 | + |
| 26 | +## Instruction |
| 27 | + |
| 28 | +Create a bash script to automate the entirety of your data acquisition and |
| 29 | +analysis to faithfully reproduce your analysis. Your analysis will contain |
| 30 | +Python scripts as well. |
| 31 | + |
| 32 | + |
| 33 | +**Download Data** |
| 34 | + |
| 35 | +Use wget or cURL to help [download the data][data]. |
| 36 | + |
| 37 | +| Wine Type | File Name | |
| 38 | +|-----------|-------------------------| |
| 39 | +| Red | `winequality-red.csv` | |
| 40 | +| White | `winequality-white.csv` | |
| 41 | + |
| 42 | +Download these data into a directory named `download`. |
| 43 | + |
| 44 | +**Hint**: Use `mkdir -p` to create a directory if it doesn't exist yet. |
| 45 | + |
| 46 | +[data]: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ |
| 47 | + |
| 48 | + |
| 49 | +**Convert Data** |
| 50 | + |
| 51 | +You like to deal with comma-separated files (CSVs). Unfortunately, you find out |
| 52 | +that the data comes in a "semi-colon" separated file. |
| 53 | + |
| 54 | +Use `sed` to convert these "semi-colon" separated files into a comma-separated |
| 55 | +files. |
| 56 | + |
| 57 | +Save these converted data into the directory `data`. |
| 58 | + |
| 59 | + |
| 60 | +**Subset Data** |
| 61 | + |
| 62 | +For your analysis you only want a couple physicochemical variables to check. |
| 63 | +There are a total of 12 variables, but you're only interested in: |
| 64 | + |
| 65 | +- Citric acid |
| 66 | +- Chlorides |
| 67 | +- pH |
| 68 | +- Alcohol |
| 69 | +- Quality (your outcome) |
| 70 | + |
| 71 | +In addition to these variables, you want only the good wine and the bad quality |
| 72 | +wine. Create four datasets, each with the threshold of 5 as being the cutoff |
| 73 | +for good wine. |
| 74 | + |
| 75 | +| File Name | Quality Threshold | Description | |
| 76 | +|-----------------------|-------------------|-------------------------| |
| 77 | +| `red_wine_poor.csv` | <= 5 | Poor quality red wine | |
| 78 | +| `red_wine_good.csv` | > 5 | Good quality red wine | |
| 79 | +| `white_wine_poor.csv` | <= 5 | Poor quality white wine | |
| 80 | +| `white_wine_good.csv` | > 5 | Good quality white wine | |
| 81 | + |
| 82 | +Put there four files into the `data` directory. |
| 83 | + |
| 84 | + |
| 85 | +**Compare Low and High Quality** |
10 | 86 |
|
| 87 | +Let's use Python to help us figure out what makes wine good or not. |
| 88 | + |
| 89 | +Create a Python function to read in data from a given path and calculate the |
| 90 | +average value of a given variable name. |
| 91 | + |
| 92 | +```python |
| 93 | +# Example use |
| 94 | +avg_chloride_results = calculate_avg_value(data, "chlorides") |
| 95 | +``` |
| 96 | + |
| 97 | +You want to be lazy and automate as much as possible. So let's create a Python |
| 98 | +function that takes in an array of the file names and returns a dictionary. |
| 99 | + |
| 100 | +The dictionary will have four keys equal to just the file names they come from |
| 101 | +e.g. the key of `white_wine_good.csv` will be `white_wine_good`. The values of |
| 102 | +each key will be another dictionary with each key being the average value of |
| 103 | +one of the four variables we're interested in: |
| 104 | + |
| 105 | +- Citric acid |
| 106 | +- Chlorides |
| 107 | +- pH |
| 108 | +- Alcohol |
| 109 | + |
| 110 | +```python |
| 111 | +wine_paths = ["white_wine_good.csv", ...] |
| 112 | +avg_values = find_average_wines(wine_paths) |
| 113 | +``` |
| 114 | + |
| 115 | + |
| 116 | +**Save Results** |
| 117 | + |
| 118 | + |
| 119 | +Write a Python function to save your dictionary of results to four separate |
| 120 | +files. Save your dictionaries as JavaScript Object Notation (JSON) files. |
| 121 | + |
| 122 | +Use the built-in `json` Python package. Here's a hint on using it. |
| 123 | + |
| 124 | +```python |
| 125 | +import json |
| 126 | + |
| 127 | +your_dictionary = {"some_date" : "date"} |
| 128 | +f = open('destFile.txt', 'w+') |
| 129 | +f.write(json.dumps(your_dictionary)) |
| 130 | +f.close() |
| 131 | +``` |
| 132 | + |
| 133 | +Save your four results into a directory `results`. |
| 134 | + |
| 135 | + |
| 136 | +**Challenge** |
| 137 | + |
| 138 | +You want to automate everything as much as possible, so you want to create a |
| 139 | +Makefile to make everything. |
| 140 | + |
| 141 | + |
| 142 | +```shell |
| 143 | +# Run the entire analysis |
| 144 | +make all |
| 145 | + |
| 146 | +# Remove all downloaded and intermediate files from data/, download/, results/ |
| 147 | +make clean |
| 148 | +``` |
| 149 | + |
| 150 | + |
| 151 | +## Homework File Structure |
| 152 | + |
| 153 | +To make things organized, please use the following structure for your data |
| 154 | +analysis. |
| 155 | + |
| 156 | +``` |
| 157 | +. |
| 158 | +|-- analyze_wine.py |
| 159 | +|-- analysis.sh |
| 160 | +|-- data/ |
| 161 | +|-- results/ |
| 162 | +`-- download/ |
| 163 | +``` |
| 164 | + |
| 165 | + |
| 166 | +## Deliverables |
11 | 167 |
|
| 168 | +- A single bash script to automate your analysis |
| 169 | +- A Python script to calculate the average citric acid, chlorides, pH, and |
| 170 | + alcohol values of good and poor quality red and white wine. |
0 commit comments