Commit 2c158d2

committed

Draft

1 parent 8fdfb6e commit 2c158d2Copy full SHA for 2c158d2

File tree

1 file changed

+105

-0

lines changed

_drafts
- 2022年03月03日-data-collection-is-hard.md

1 file changed

+105

-0

lines changed

`‎_drafts/2022-03-03-data-collection-is-hard.md‎`

Lines changed: 105 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,105 @@`
	`1`	`+---`
	`2`	`+title: Data Collection is Hard. You Should Try It.`
	`3`	`+excerpt: "No, seriously."`
	`4`	`+tags:`
	`5`	`+ - data`
	`6`	`+header:`
	`7`	`+ overlay_image: /assets/images/cool-backgrounds/cool-background9.png`
	`8`	`+ overlay_filter: 0.1`
	`9`	`+ caption: 'Photo credit: [coolbackgrounds.io](https://coolbackgrounds.io/)'`
	`10`	`+last_modified_at: 2022年03月03日`
	`11`	`+---`
	`12`	`+`
	`13`	`+For people who make careers out of data, data scientists don't have nearly`
	`14`	`+enough experience in data collection, and many data scientists don’t even seem`
	`15`	`+to feel the need to develop experience collecting data.`
	`16`	`+`
	`17`	`+Puzzlingly, this trend doesn’t seem to be true of other forms of unglamorous`
	`18`	`+data work like data cleaning (where people generally accept that [data cleaning`
	`19`	`+is not grunt`
	`20`	`+work](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt)).`
	`21`	`+`
	`22`	`+With this blog post I want to give a defense of data collection — not as an`
	`23`	`+activity that’s inherently worthwhile pursuing (I assume data scientists don’t`
	`24`	`+need to be convinced of that!), but as something that is worth doing even for`
	`25`	`+selfish reasons. Why should you spend time learning about that data`
	`26`	`+collection system that's being maintained by that other team at work? Why`
	`27`	`+should you consider collecting some data for your next side project? What's in`
	`28`	`+it for _you_?`
	`29`	`+`
	`30`	`+Throughout this blog post, I’ll be making comparisons to a recent project of`
	`31`	+mine, [`cryptics.georgeho.org`](https://cryptics.georgeho.org/), a dataset of
	`32`	`+cryptic crossword clues.`
	`33`	`+`
	`34`	`+## Learn Data-Adjacent Technologies`
	`35`	`+`
	`36`	`+The most obvious reason is that data collection is a fantastic opportunity to`
	`37`	`+familiarize yourself with many staple technologies in data - and there aren't`
	`38`	`+that many side projects that run the entire data tech stack!`
	`39`	`+`
	`40`	`+To enumerate:`
	`41`	`+`
	`42`	`+- Compute services`
	`43`	`+ - Your data collection pipelines will obviously need to run somewhere. Will`
	`44`	`+ that be in the cloud, or on your local computer? How do you think about`
	`45`	`+ trading off cost, compute and convenience?`
	`46`	`+ - I ran most of my web scraping on DigitalOcean Droplets, but I could just`
	`47`	`+ as easily have taken the opportunity to learn more about cloud compute`
	`48`	`+ solutions or serverless functions like AWS EC2 or Lambda. These days, the`
	`49`	`+ project runs incremental scrapes entirely on my laptop.`
	`50`	`+- Data storage`
	`51`	`+ - You’ll need to store your data somewhere, whether it be a relational or`
	`52`	`+ NoSQL database, or just flat files. Since your data will outlive any code`
	`53`	`+ you write, careful design of the data storage solution and schema will`
	`54`	`+ pay dividends in the long run.`
	`55`	`+ - I used SQLite for its simplicity and performance. However, as the scope`
	`56`	`+ of the project expanded, I had to redesign the schema multiple times,`
	`57`	`+ which was painful.`
	`58`	`+- Labeling, annotation or other data transformations`
	`59`	`+ - After collecting your data, you may want to label, annotate or other`
	`60`	`+ structure or transform your data. For example, perhaps you’ll want to`
	`61`	`+ pull structured tabular data out of unstructured PDFs or HTML tag soups;`
	`62`	`+ another example might be to have a human label the data.`
	`63`	`+ - This is the main "value-add" of your dataset — while the time and effort`
	`64`	`+ required to collect and store the data constitutes a moat, ultimately`
	`65`	`+ what will distinguish your dataset to users will be the transformations`
	`66`	`+ done here.`
	`67`	+ - For me, this involved a lot of `BeautifulSoup` to parse structured data
	`68`	`+ out of HTML pages. This required a [significant amount of development and`
	`69`	`+ engineering`
	`70`	`+ effort](https://cryptics.georgeho.org/datasheet#collection-process).`
	`71`	`+- Data licensing and copyright`
	`72`	`+ - Once you have your dataset, what is the legality of licensing, sharing or`
	`73`	`+ even selling your data? The legality of data are a huge grey area`
	`74`	`+ (especially if there’s any web scraping involved), and while navigating`
	`75`	`+ these waters will be tricky, it’s instructive to learn about it.`
	`76`	`+ - I feel like the collection and structuring of cryptic crossword clues for`
	`77`	`+ academic/archival purposes was fair use, and so didn’t worry too much`
	`78`	`+ about the legality of my project — but it was an educational rabbit hole`
	`79`	`+ to fall down!`
	`80`	`+- Sharing and publishing data`
	`81`	`+ - The legal nuances of data aside, the technical problem of sharing data is`
	`82`	`+ pretty tricky!`
	`83`	`+ - This problem sits at the intersection of MLOps and information design:`
	`84`	`+ you want to share the data in a standardized way, while having an`
	`85`	`+ interface that making it easy for users to explore your data. Serving a`
	`86`	`+ tarball on a web server technically works, but leaves so much on the`
	`87`	`+ table.`
	`88`	+ - `cryptics.georgeho.org` uses [Datasette](https://datasette.io/), which I
	`89`	`+ can’t recommend highly enough.`
	`90`	`+- Writing documentation`
	`91`	`+ - If you think it’s hard to write and maintain good documentation for`
	`92`	`+ software, imagine how difficult it must be to do the same for data, which`
	`93`	`+ outlives software and is much harder to both create and version control.`
	`94`	`+ - I’ve found [Gebru et al.’s Datasheets for`
	`95`	`+ Datasets](https://arxiv.org/abs/1803.09010) to be an excellent template`
	`96`	`+ for documenting data.`
	`97`	`+`
	`98`	`+## Design a Data Collection System`
	`99`	`+`
	`100`	`+More importantly, starting a small data collection project is a great way to`
	`101`	`+get experience designing an entire data pipeline from end to end. This kind of`
	`102`	`+opportunity doesn't come easily (even in industry!), and while your data`
	`103`	`+pipeline won't be as sophisticated as the kinds you'll find in data companies,`
	`104`	`+you'll be be able to take away some valuable lessons from it.`
	`105`	`+`

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 2c158d2

File tree

1 file changed

1 file changed

`‎_drafts/2022-03-03-data-collection-is-hard.md‎`

0 commit comments