Jump to content

Wikimedia Meta-Wiki

Massively-Multiplayer Online Bibliography

From Meta, a Wikimedia project coordination wiki

Massively-Multiplayer Online Bibliography (MMOB) is the name for a series of projects aiming to perform significant feats of online bibliography in a fun, collaborative, and principled way, that would be useful to everyone and acceptable to professionals. It will rely on volunteer labor, free software, and open Web standards.

The Aboutness Project

The Need

There are hundreds of millions of essays and articles out there. Many of them are already available online, in one or another of the large free repositories, such as Project Gutenberg, the Internet Archive, the Hathi Trust, etc.

However, while their full text is available, searchable, and indexed by search engines -- hence discoverable when searching by words in the text -- there is no good way to distinguish between the tens of thousands of articles that mention Timbuktu and the significantly fewer that are about Timbuktu.

"Oh, but Google will send you to good relevant articles about Timbuktu", you might say. That's most probably true for most topics, however, the way PageRank and similar algorithms work, it would only send you to those resources already identified as relevant or useful by people linking to them, which reinforces and recirculates largely the same group of resources for most queries. How would we ever discover additional resources already available to us in the large and growing open repositories?

Traditional library catalogues offer (human-curated) "aboutness" statements for catalogue items. However, catalogue items are typically book volumes rather than individual essays and articles. Thus, a library catalogue will tell us Francis Bacon's Essays (Dover edition) is about "1. English essays -- Early modern. 1500-1700." Not terribly useful, is it? That, itself, does not tell us some of these essays are about "truth", "envy", "sedition", "revenge", etc.

T. S. Eliot's The Sacred Wood is about "Criticism" and "Literature", according to the Library of Congress, but this does not help us discover the influential essays on "Hamlet and His Problems" or "Tradition and the Individual Talent" inside. Let us also remember that even tables of contents are not enough: another essay in Eliot's book is called "A Romantic Aristocrat"; a fine title, but it gives no clue as to who it is about. Lord Byron, perhaps?

Conversely, if we had an extensive data collection of "aboutness" statements (essay X is about topic Y), much of the currently-invisible cultural wealth already available online will become discoverable, and therefore found, read, used, discussed, and built upon, once again enriching our present and future culture and research. It would tell us, for example, that "A Romantic Aristocrat" is in fact about George Wyndham, and it would contribute to a large collection's ability to answer the question "What works do you have that are about (not just mention) George Wyndham?". Wouldn't that be tremendously helpful?

The Proposed Solution

Wouldn't it be nice to be able to browse a huge list of essays -- by language, period, author, title -- pick one that you'd like to read, read it, and then pick one or more topics it was about, from a standardized tree-like list of topics? Or to go over other people's previous classifications and endorse or question them with a single click, to help create a more robust result?

We can build such a collection, one essay and one "aboutness" statement at a time. And we can do so in a way that builds on and interoperates with other large-scale bibliographic efforts, so nothing is wasted.

Essentially, we would build a crowdsourced curation system that would attach multiple "aboutness" values to each individual work (article, essay):

Volunteers would read a work, pick a language to classify in (remember: different classification schemes break the universe down into different ontologies), pick a classification source to classify by, where more than one is available (e.g. Library of Congress Subject Headings, Wikipedia article titles, Library of Congress item titles), be presented with a convenient, browsable, navigable, searchable tree-like view of classifications, and select one or more classifications to attach to the work.
Volunteers would also be able to "upvote" or "downvote" other volunteers' classifications, to help gain confidence in some classifications over others. (This later allows a user searching for material to constrain the search to, for example, only works that have a particular classification at confidence level 3 or more, if an unconstrained search produced too many false positives.)
Users would be able to search for materials on the open Web according to one or more of these classifications

Bibliographical Aspects

Stable URIs

To create an aboutness statement, we need stable identifiers for both the individual work and the topic. Most databases do not, today, catalog at the individual work level, so there's much work to be done.

We can begin with ad-hoc subdivisions (e.g. Project Gutenberg text number N, article number M, can be made into an ID of the form http://aboutness.org/work/pg_N_item_M
We can also begin by working only on essays contained in databases that do catalog at the work level (e.g. Project Ben-Yehuda [disclosure: Ijon is its founding editor])
Gradually, the Table of Contents for Everything project will provide us with stable URIs for more and more essays and articles we can classify.
As always in linked open data, sameAs relationships can subsequently be established between whatever URIs we end up using and the URIs of major databases (e.g. Library of Congress), when they get around to cataloging at work level.

Subject authority data

There are already several authority files (i.e. sets of data including possible "subject headings" one might assign to a work for an "aboutness" statement) from libraries and related institutions published as (linked) open data on the web. For an overview see the datasets tagged with "authorities" on the Data Hub. Datasets from other institutions (e.g. wikidata) might be relevant as well.

LCSH subject headings are already publicly available as linked data, and can be used for English-speaking classifiers. Similar "thesauruses" or classification trees need to be made accessible to allow volunteers to add classifications in other languages, for the benefit of searchers in other languages. This is a project for experienced volunteers who can "talk the talk" with bibliographers and national libraries.
Wikipedia/Wikidata entries can also be the topics of an essay.
A valuable source is the Virtual International Authority File (VIAF) that links together many national authority files from all over the world to one virtual authority file. That means: If you have the viaf ID for a subject you get the corresponding subject headings from up to> 30 other authority files includign wikipedia. Website: http://viaf.org; Full data dumps available, see http://viaf.org/viaf/data/ http://viaf.org/viaf/data/
The Integrated Authority File (GND) used for subject cataloging in German-speaking libraries and curated in a distributed fashion by many German libraries and library service centers is also available as Linked Open Data.
The Rameau subject headings used in France are also available as linked open data.

Simple examples

T.S. Eliot's "Hamlet and His Problems" -- could be classified as ABOUT (or dcterms:subject, etc.) --

http://id.loc.gov/authorities/names/n80008522 -- "Hamlet (work)" (from LCSH)
http://id.loc.gov/authorities/subjects/sh85058566 -- "Hamlet (Legendary character)" (this is from the Library of Congress Subject Headings)
http://id.loc.gov/authorities/subjects/sh2008112835 -- "Theater--England--History--16th century" (likewise)
http://www.wikidata.org/wiki/Q2447542 -- "Prince Hamlet" (an item on Wikidata, about the fictional character Hamlet) -- sufficient to retrieve multi-lingual labels, link to Wikipedia articles, etc.
http://www.wikidata.org/wiki/Q41567 -- "Hamlet" (an item on Wikidata, about the play by Shakespeare) -- likewise
http://viaf.org/viaf/176993890 -- "Hamlet (work)" (from viaf)
http://d-nb.info/gnd/4099350-4 -- "Hamlet (work)" (from GND)
http://d-nb.info/gnd/118545345 -- "Hamlet (fictive person/legendary figure)" (from GND)
http://data.bnf.fr/ark:/12148/cb11936813g -- "Hamlet (work)" (from Rameau)

9-19. other examples in English or or other languages

All of these classifications are stored (either as Linked Data triples or in some conventional RDBMS [exposable as triples]) and can then be reviewed, revised, upvoted/downvoted, and of course searched.

Important note: The above is a mixture of library metadata and linked data geekery. If it makes no sense to you, please don't worry, you can still be involved in the project!

Relevant Data sets

Please help collect some information about available text repositories we might begin classifying.

Web site	Open texts?	Stable URIs?	Work-level URIs?	Notes
Wikisource	yes	yes	it's complicated :)^[1]	multi-lingual
Project Gutenberg	yes	yes	no	mostly in English, but other languages as well
Project Runeberg	yes	yes	no	mostly in Swedish
Project Ben-Yehuda	yes	yes	yes	all in Hebrew
Internet Archive	yes	yes	no	all languages
PubMed Central (Open Access Subset)	yes	yes	yes	English; biomedical research articles
...	...	...	...	...

There is a list of open collections curated by the OpenGLAM initiative that may be of interest in this context. Only few of the collections are collections of textual material (mostly manuscripts), most are collections of digitized works of art, of digitized photographs (sometimes containing manuscripts). of digital sound or of digitized comics.

Technological Principles

All work will happen on the Web, via a modern browser. (i.e. no required downloads, no Flash, no IE6 :))
The Aboutness Project is humble: it seeks to create value in an underserved area (discoverability of non-academic non-fiction resources), in a non-exclusive and non-authoritative manner, and it makes no claim for being comprehensive (yet).
The Aboutness Project is a good netizen: we build on free software and open resources, and we aim to not duplicate efforts or reinvent wheels. We give back: our code and data will be placed in the public domain (and/or CC0).
The Aboutness Project starts with low-hanging fruit: We start with resources that are readily available with work-level URIs (e.g. some works on English Wikisource), and with authority data that's available and open. We'll learn as we go, and will gradually reach for higher fruit.
much more TBD

Technical Questions

Where do we have the conversation? -- on this wiki page? On a mailing list (which?)?
Shall we store the aboutness triples on Wikidata? We can share them or publish them in any number of ways, but what is to be our primary store? (storing them on Wikidata means a Wikidata item for every essay!)
Consider using CiTO^[2] or something like that?

How can I help?

Right now we're still hatching the idea. But down the road we'll need:

library metadata and linked open data geeks (MARC, Dublin Core, FRBR, SKOS, RDA, OAI-PMH, etc.)
Web hackers (Ruby, Python, Javascript, PHP)
UI designers, usability experts, graphics artists
outreach volunteers (bloggers, social media gurus, Wikimedians, librarians)

I'm interested!

Great! Please sign your username below, and we'll get in touch when we set up a mailing list or something. Also, add this page to your watchlist and participate in the brainstorming! :)

Parallel project: The Table of Contents for Everything

A volunteer project to make detailed digital tables of contents freely available to all, with stable URIs for each work, to serve e.g. in the Aboutness Project above.

The Need

A huge amount of books are now available as either scans/PDFs or text thanks to massive digitization projects such as the ones by The Internet Archive, Google, the Hathi Trust, etc.

Those projects focus on quantity over quality, perhaps leaving the meticulous improving of metadata for later, but quite probably, never.

Among those books, the ones that are least well described by metadata are non-fiction collections -- essay collections, article anthologies, digests. That's because book-level metadata can never do justice for the multiple items inside.

The Aboutness Project (above) can help classify these individual works by their content, but, it needs a way to refer to these individual works in the first place, and that's not available for individuals essays in the book-level resources exposed by the aforementioned services.

The solution

The "proper" solution would, of course, be to change the way the content hosts (Internet Archive etc.) operate, and add work-level cataloging and content-management. Since that is a formidable task and beyond our control, what MMOB can do about it is this:

We can create an extrinsic catalogue for these works, all pointing at the one (book-level) resource, but featuring individual data entities for every work (essay, article) inside. Our data entities (themselves metadata for the actual content at the original host) can then be used in The Aboutness Project. The data would be created by volunteers, typing (or proofreading OCRed) tables of contents and identifying authors (with VIAF etc.).

No less importantly, the data entities we produce can serve as the basis for the original hosts' catalogue, if and when they begin supporting work-level cataloguing.

More description TBD.

See also

WikiCite/Shared Citations

References

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Massively-Multiplayer_Online_Bibliography&oldid=26914041"