At Wikicite 2017,
discussions revolved around an ambitious goal to use Wikidata to create
a central citation database for all the references on Wikipedia.
Citations are the essential building blocks of verifiability, a core tenet of Wikipedia. This project aims to give citations the first-class treatment they deserve.
We saw three important questions emerge at the conference:
What does a good citation database look like?
How can we build this on Wikidata?
How can we integrate this with Wikipedia?
These are hard questions. To answer them, Wikicite brought together:
Expert ontologists and librarians specializing in citation and reference modeling
Wikicite may be young, but clear progress has been made already. Wikidata now
boasts some great collections of bibliographic data, like the Zika corpus and data from the genewiki project. Some Wikipedias, like French and Russian, are experimenting with generating citations using Wikidata. Some citation databases are
integrated with Visual Editor to make it easy to add rich citations on
Wikipedia–which can hopefully one day be added to Wikidata for further
reuse and tracking.
There
are still a few features that Wikidata needs to be a first-class host
for citation data. Even the best structured data takes time to define in
Wikidata’s precise terms of
items, properties, modifiers, and qualifications. Although it’s
possible to use some handy tools on Wikidata for bulk actions, it often
requires changing your dataset to match the tool’s specific format, or
writing bespoke code for your dataset. It’s still challenging to ensure
data is high quality, well-sourced, and ready for long-term maintenance.
In
listening to researchers’ talks, discussing with experts in working
groups, and workshopping code with some of Wikidata’s soon-to-be biggest
users, we determined that Wikidata needs seven features for true
Wikicite readiness:
Bulk data import. There must be an easy process for loading large amounts of data onto Wikidata. There are a few partial tools, like QuickStatements,
which, while itself aptly-named, is just one part of an often-arduous
workflow. Other people have written custom bots to import their specific
dataset, on top of libraries like Wikidata Integrator or pywikibot. Without help from an experienced Wikidata contributor, there is not an easy self-service way to move data in bulk.
Sets. Wikidata needs a feature to track and curate specific groups of items. Sets are a necessary concept to answer questions about a complete group. Right now, you can use Wikidata to tell you facts about the states in Austria, but it cannot tell you the complete list of all states in Austria. Sets are key for curators to perform this sort of cross-sectional data management.
Data management tools.
Data curators need tools to monitor data of interest. Wikidata is big.
The basic tools like watchlists were designed for Wikipedians articles
on a much smaller scale, with a much coarser granularity than the
Wikidata statement or qualifier. An institution that donates data to
Wikidata may want to monitor thousands (maybe hundreds of thousands) of
items and properties. Donors of complete datasets will want to watch
their data for deletions, additions, and edits.
Grouping edit events.
At the moment, many community members are adding data to Wikidata in
bulk, but this is a fact that Wikidata’s user interface struggle to
represent. Wikidata currently offers a piecemeal history of user’s
individual edits, and encourages editors to add citations and references
for individual statements. These features are vital, but we need a
higher-level grouping feature for higher-level data uploads. For
instance, it would be helpful to have an “upload ID” for associated
edits across many claims. It would also be useful to have a dedicated
namespace for human- and machine-readable documentation of the data load
process, a kind of README that addresses the whole action. This kind of
documentation not only helps community members get answers to questions
before, during, and after large-scale activity, but it also helps
future data donors learn about and follow best practices.
A draft or staging space.
There should be a way for people to add content to Wikidata without
directly modifying “live” data. Currently, when something is added to
Wikidata it is immediately mixed in with everything else. It’s daunting
for new users to have to get it right on the first try, let alone take
quick corrective action in the face of inevitable mistakes. Modeling a
dataset in Wikidata’s terms requires using Wikidata’s specific
collection of items and properties. You may not see how your data fits
into Wikidata—perhaps requiring new properties and items—until you begin
to add it. Experienced Wikidata volunteers may review data to ensure
it’s high quality, but it would be better to enable this collaborative
process before data is part of the project’s official collection. You
should be able to upload your data to a staging space on Wikidata,
ensure it’s high quality and properly structured, and then publish it
when it’s ready. The PrimarySources tool is
a community-driven start to this, but such a vital feature needs
support from the core. In the longer term, this feature is a small step
toward maximizing Wikidata consistency, by setting the stage to
transactionally add and modify large-scale data. It would be helpful to have data cleanup tools, similar to OpenRefine, available for data staging.
Data models.
Wikidata needs new ways to collaborate on new kinds of items.
Specifically, we need a better way to reach consensus on models for
certain standard types of data. Currently, it’s possible describe the
same entity in multiple ways, and lacking a forum for this process, it’s
hard to discuss the differences. See, for instance, the drastically
different ways that various subway lines are
described as Wikidata items. Additionally, some models may want to
impose certain constraints on instances, or at least indicate if an item
complies with its model. Looking to the future, tools for collaborative
data modeling would grow to include a library of data models unlike any
other.
Point in time links.
There should be a way to share a dataset from Wikidata at a given point
in time. Wikidata, like Wikipedia, is continuously changing. Wikipedia
supports linking to a specific revision of an article at a point in time
using a permalink, and you can do the same for a specific Wikidata
item. However, Wikidata places special emphasis on relationships between
items, yet does not extend the permalink feature to these
relationships. If you run a query on the Wikidata Query Service (the SPARQL endpoint for Wikidata), and then share the query with someone else, they may see different results.
These
seven features came up consistently across several groups and
discussions at Wikicite. As a room full of problem solvers, several good
projects are already underway to provide community-based solutions in
these areas. Among the handful that were started at the conference, we
are pleased to share we’ve started work on Handcart,
a tool for simplifying medium-sized bulk imports, for citation data,
and much, much more. We believe trying to fix a problem is the best way
to learn its details and nuances.
Wikicite
made a strong case that Wikidata has a lot of valuable potential for
citations, and citations are crucial for Wikipedia. As we work to
address these missing features in Wikidata, we are happy to be part of
the Wikicite movement to build a more verifiable world.