|
| 1 | +--- |
| 2 | +title: Data Collection is Hard. You Should Try It. |
| 3 | +excerpt: "No, seriously." |
| 4 | +tags: |
| 5 | + - data |
| 6 | +header: |
| 7 | + overlay_image: /assets/images/cool-backgrounds/cool-background9.png |
| 8 | + overlay_filter: 0.1 |
| 9 | + caption: 'Photo credit: [coolbackgrounds.io](https://coolbackgrounds.io/)' |
| 10 | +last_modified_at: 2022年03月03日 |
| 11 | +--- |
| 12 | + |
| 13 | +For people who make careers out of data, data scientists don't have *nearly* |
| 14 | +enough experience in data collection, and many data scientists don’t even seem |
| 15 | +to feel the need to develop experience collecting data. |
| 16 | + |
| 17 | +Puzzlingly, this trend doesn’t seem to be true of other forms of unglamorous |
| 18 | +data work like data cleaning (where people generally accept that [data cleaning |
| 19 | +is not grunt |
| 20 | +work](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt)). |
| 21 | + |
| 22 | +With this blog post I want to give a defense of data collection — not as an |
| 23 | +activity that’s inherently worthwhile pursuing (I assume data scientists don’t |
| 24 | +need to be convinced of that!), but as something that is worth doing even for |
| 25 | +*selfish* reasons. Why should you spend time learning about that data |
| 26 | +collection system that's being maintained by that other team at work? Why |
| 27 | +should you consider collecting some data for your next side project? What's in |
| 28 | +it for _you_? |
| 29 | + |
| 30 | +Throughout this blog post, I’ll be making comparisons to a recent project of |
| 31 | +mine, [`cryptics.georgeho.org`](https://cryptics.georgeho.org/), a dataset of |
| 32 | +cryptic crossword clues. |
| 33 | + |
| 34 | +## Learn Data-Adjacent Technologies |
| 35 | + |
| 36 | +The most obvious reason is that data collection is a fantastic opportunity to |
| 37 | +familiarize yourself with many staple technologies in data - and there aren't |
| 38 | +that many side projects that run the entire data tech stack! |
| 39 | + |
| 40 | +To enumerate: |
| 41 | + |
| 42 | +- Compute services |
| 43 | + - Your data collection pipelines will obviously need to run somewhere. Will |
| 44 | + that be in the cloud, or on your local computer? How do you think about |
| 45 | + trading off cost, compute and convenience? |
| 46 | + - I ran most of my web scraping on DigitalOcean Droplets, but I could just |
| 47 | + as easily have taken the opportunity to learn more about cloud compute |
| 48 | + solutions or serverless functions like AWS EC2 or Lambda. These days, the |
| 49 | + project runs incremental scrapes entirely on my laptop. |
| 50 | +- Data storage |
| 51 | + - You’ll need to store your data somewhere, whether it be a relational or |
| 52 | + NoSQL database, or just flat files. Since your data will outlive any code |
| 53 | + you write, careful design of the data storage solution and schema will |
| 54 | + pay dividends in the long run. |
| 55 | + - I used SQLite for its simplicity and performance. However, as the scope |
| 56 | + of the project expanded, I had to redesign the schema multiple times, |
| 57 | + which was painful. |
| 58 | +- Labeling, annotation or other data transformations |
| 59 | + - After collecting your data, you may want to label, annotate or other |
| 60 | + structure or transform your data. For example, perhaps you’ll want to |
| 61 | + pull structured tabular data out of unstructured PDFs or HTML tag soups; |
| 62 | + another example might be to have a human label the data. |
| 63 | + - This is the main "value-add" of your dataset — while the time and effort |
| 64 | + required to collect and store the data constitutes a moat, ultimately |
| 65 | + what will distinguish your dataset to *users* will be the transformations |
| 66 | + done here. |
| 67 | + - For me, this involved a lot of `BeautifulSoup` to parse structured data |
| 68 | + out of HTML pages. This required a [significant amount of development and |
| 69 | + engineering |
| 70 | + effort](https://cryptics.georgeho.org/datasheet#collection-process). |
| 71 | +- Data licensing and copyright |
| 72 | + - Once you have your dataset, what is the legality of licensing, sharing or |
| 73 | + even selling your data? The legality of data are a huge grey area |
| 74 | + (especially if there’s any web scraping involved), and while navigating |
| 75 | + these waters will be tricky, it’s instructive to learn about it. |
| 76 | + - I feel like the collection and structuring of cryptic crossword clues for |
| 77 | + academic/archival purposes was fair use, and so didn’t worry too much |
| 78 | + about the legality of my project — but it was an educational rabbit hole |
| 79 | + to fall down! |
| 80 | +- Sharing and publishing data |
| 81 | + - The legal nuances of data aside, the technical problem of sharing data is |
| 82 | + pretty tricky! |
| 83 | + - This problem sits at the intersection of MLOps and information design: |
| 84 | + you want to share the data in a standardized way, while having an |
| 85 | + interface that making it easy for users to explore your data. Serving a |
| 86 | + tarball on a web server technically works, but leaves so much on the |
| 87 | + table. |
| 88 | + - `cryptics.georgeho.org` uses [Datasette](https://datasette.io/), which I |
| 89 | + can’t recommend highly enough. |
| 90 | +- Writing documentation |
| 91 | + - If you think it’s hard to write and maintain good documentation for |
| 92 | + software, imagine how difficult it must be to do the same for data, which |
| 93 | + outlives software and is much harder to both create and version control. |
| 94 | + - I’ve found [Gebru et al.’s Datasheets for |
| 95 | + Datasets](https://arxiv.org/abs/1803.09010) to be an excellent template |
| 96 | + for documenting data. |
| 97 | + |
| 98 | +## Design a Data Collection System |
| 99 | + |
| 100 | +More importantly, starting a small data collection project is a great way to |
| 101 | +get experience designing an entire data pipeline from end to end. This kind of |
| 102 | +opportunity doesn't come easily (even in industry!), and while your data |
| 103 | +pipeline won't be as sophisticated as the kinds you'll find in data companies, |
| 104 | +you'll be be able to take away some valuable lessons from it. |
| 105 | + |
0 commit comments