Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 2c158d2

Browse files
committed
Draft
1 parent 8fdfb6e commit 2c158d2

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: Data Collection is Hard. You Should Try It.
3+
excerpt: "No, seriously."
4+
tags:
5+
- data
6+
header:
7+
overlay_image: /assets/images/cool-backgrounds/cool-background9.png
8+
overlay_filter: 0.1
9+
caption: 'Photo credit: [coolbackgrounds.io](https://coolbackgrounds.io/)'
10+
last_modified_at: 2022年03月03日
11+
---
12+
13+
For people who make careers out of data, data scientists don't have *nearly*
14+
enough experience in data collection, and many data scientists don’t even seem
15+
to feel the need to develop experience collecting data.
16+
17+
Puzzlingly, this trend doesn’t seem to be true of other forms of unglamorous
18+
data work like data cleaning (where people generally accept that [data cleaning
19+
is not grunt
20+
work](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt)).
21+
22+
With this blog post I want to give a defense of data collection — not as an
23+
activity that’s inherently worthwhile pursuing (I assume data scientists don’t
24+
need to be convinced of that!), but as something that is worth doing even for
25+
*selfish* reasons. Why should you spend time learning about that data
26+
collection system that's being maintained by that other team at work? Why
27+
should you consider collecting some data for your next side project? What's in
28+
it for _you_?
29+
30+
Throughout this blog post, I’ll be making comparisons to a recent project of
31+
mine, [`cryptics.georgeho.org`](https://cryptics.georgeho.org/), a dataset of
32+
cryptic crossword clues.
33+
34+
## Learn Data-Adjacent Technologies
35+
36+
The most obvious reason is that data collection is a fantastic opportunity to
37+
familiarize yourself with many staple technologies in data - and there aren't
38+
that many side projects that run the entire data tech stack!
39+
40+
To enumerate:
41+
42+
- Compute services
43+
- Your data collection pipelines will obviously need to run somewhere. Will
44+
that be in the cloud, or on your local computer? How do you think about
45+
trading off cost, compute and convenience?
46+
- I ran most of my web scraping on DigitalOcean Droplets, but I could just
47+
as easily have taken the opportunity to learn more about cloud compute
48+
solutions or serverless functions like AWS EC2 or Lambda. These days, the
49+
project runs incremental scrapes entirely on my laptop.
50+
- Data storage
51+
- You’ll need to store your data somewhere, whether it be a relational or
52+
NoSQL database, or just flat files. Since your data will outlive any code
53+
you write, careful design of the data storage solution and schema will
54+
pay dividends in the long run.
55+
- I used SQLite for its simplicity and performance. However, as the scope
56+
of the project expanded, I had to redesign the schema multiple times,
57+
which was painful.
58+
- Labeling, annotation or other data transformations
59+
- After collecting your data, you may want to label, annotate or other
60+
structure or transform your data. For example, perhaps you’ll want to
61+
pull structured tabular data out of unstructured PDFs or HTML tag soups;
62+
another example might be to have a human label the data.
63+
- This is the main "value-add" of your dataset — while the time and effort
64+
required to collect and store the data constitutes a moat, ultimately
65+
what will distinguish your dataset to *users* will be the transformations
66+
done here.
67+
- For me, this involved a lot of `BeautifulSoup` to parse structured data
68+
out of HTML pages. This required a [significant amount of development and
69+
engineering
70+
effort](https://cryptics.georgeho.org/datasheet#collection-process).
71+
- Data licensing and copyright
72+
- Once you have your dataset, what is the legality of licensing, sharing or
73+
even selling your data? The legality of data are a huge grey area
74+
(especially if there’s any web scraping involved), and while navigating
75+
these waters will be tricky, it’s instructive to learn about it.
76+
- I feel like the collection and structuring of cryptic crossword clues for
77+
academic/archival purposes was fair use, and so didn’t worry too much
78+
about the legality of my project — but it was an educational rabbit hole
79+
to fall down!
80+
- Sharing and publishing data
81+
- The legal nuances of data aside, the technical problem of sharing data is
82+
pretty tricky!
83+
- This problem sits at the intersection of MLOps and information design:
84+
you want to share the data in a standardized way, while having an
85+
interface that making it easy for users to explore your data. Serving a
86+
tarball on a web server technically works, but leaves so much on the
87+
table.
88+
- `cryptics.georgeho.org` uses [Datasette](https://datasette.io/), which I
89+
can’t recommend highly enough.
90+
- Writing documentation
91+
- If you think it’s hard to write and maintain good documentation for
92+
software, imagine how difficult it must be to do the same for data, which
93+
outlives software and is much harder to both create and version control.
94+
- I’ve found [Gebru et al.’s Datasheets for
95+
Datasets](https://arxiv.org/abs/1803.09010) to be an excellent template
96+
for documenting data.
97+
98+
## Design a Data Collection System
99+
100+
More importantly, starting a small data collection project is a great way to
101+
get experience designing an entire data pipeline from end to end. This kind of
102+
opportunity doesn't come easily (even in industry!), and while your data
103+
pipeline won't be as sophisticated as the kinds you'll find in data companies,
104+
you'll be be able to take away some valuable lessons from it.
105+

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /