Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.
Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.
This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:
strip (removes empty rows), tail (truncate dataset to its last rows)range function and a "complement" switch for cut; options for grep
Have a look at the closed issues to see more of what Andy has been up to.
Last week we introduced you to Webshot, a web API for screenshots of web pages.
Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.
Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.
Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.
"The beauty of NLP," Tarek says, "is that it enables computers to extract knowledge from unstructured data inside textual documents." Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.
Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?
Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.
Nomenklatura is a Labs project that does data reconciliation, making it possible "to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list".
Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed "a fairly radical re-framing of the service".
The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.
Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.
Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: "Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page."
Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.
Follow @okfnlabs
After 6 years at Google, Daniel Fireman is currently a Ph.D. student, professor and activist for government transparency and accountability...
Today we’re releasing a major version for datapackage-pipelines, version 2.0.0. This new version marks a big step forward in realizing...
Data Factory is an open framework for building and running lightweight data processing workflows quickly and easily. We recommend reading...
Today I’d like to introduce a new library we’ve been working on - dataflows. DataFlows is a part of a...
Matt Thompson was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data data...
Georges Labrèche was one of 2017’s Frictionless Data Tool Fund grantees tasked with extending implementation of core Frictionless Data libraries...