Grants:Project/Rapid/Hjfocs/soweego 1.1
Project Goal
soweego
[1] links Wikidata items to large external catalogs.
It is an artificial intelligence based on multiple machine learning[2] algorithms (AKA linkers).
Its vision is to make Wikidata the nucleus of the open data landscape.
The main goal of this proposal is to automatically get the highest-quality links by bringing soweego
linkers together: unity is strength.
Problem
Pretty much like a human, soweego
claims that a given Wikidata item links to a given catalog identifier with different levels of confidence.
Currently, it only considers the confidence yielded by one linker (the best), thus not leveraging any relationship or information captured by others. That is to say, the system has only one pair of eyes, but it could indeed benefit from extra viewpoints.
Therefore, we can improve the quality and quantity of links by letting soweego
linkers join forces.
Solution
Machine learning algorithms capture information in heterogeneous ways, and they have been shown to perform better together, rather than alone.[3] [4] [5]
We propose to build an ensemble system,[6] and to implement it as an enhancement of the soweego
linker module.[7]
Furthermore, linkers can behave differently depending on the external catalog. Hence, it is important to automatically tune the weight of each linker in the ensemble. Finally, we will automatically set the optimal parameters of each linker through cross-validation[8] techniques.
Project Plan
Activities
- State of the art: explore best practices in ensemble learning and investigate related approaches applied to
soweego
's task, namely record linkage;[9] - add decision trees[10] to the current pool of linkers;
- develop the ensemble system;
- implement automatic hyperparameters tuning of linkers;
- implement automatic weighting of each linker, for each supported catalog;
- evaluate performance and compare to previous results without ensemble;
- write reports and include them in a MSc thesis at the University of Trento (Q930528), supervised by Hjfocs.
Outcomes
- Release of
soweego
unity is strength (version 1.1); - delivery of ready-to-use documentation;
- engagement of developers through the standard social coding workflow: understand, fork, make a pull request.
Community notification
- Wikidata: https://lists.wikimedia.org/pipermail/wikidata/2019-July/013263.html
- Wiki research: https://lists.wikimedia.org/pipermail/wiki-research-l/2019-July/006864.html
- Wikimedia AI: https://lists.wikimedia.org/pipermail/ai/2019-July/000277.html
Impact
- 229k confident Wikidata identifier statements created or referenced;[11]
- 124k link candidates uploaded to the Mix'n'match tool[12] for curation;[11]
- 4 pull requests submitted to the
soweego
code repository, under the Wikidata GitHub organization.[13]
Resources
Hjfocs will work tighly with Tupini07, and supervise his MSc thesis at the University of Trento (Q930528), together with Prof. Passerini.[14] We will not receive any additional support.
The whole budget is allocated to the implementation efforts.
References
- ↑ Grants:Project/Hjfocs/soweego
- ↑ en:Machine_learning
- ↑ https://jair.org/index.php/jair/article/view/10239/24370
- ↑ http://users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/csm06.pdf
- ↑ https://www.researchgate.net/profile/Lior_Rokach/publication/220637823_Ensemble-based_classifiers/links/55bf427008aed621de122c52/Ensemble-based-classifiers.pdf
- ↑ en:Ensemble_learning
- ↑ https://soweego.readthedocs.io/en/latest/linker.html
- ↑ en:Cross-validation_(statistics)
- ↑ en:Record_linkage
- ↑ en:Decision_tree_learning
- ↑ a b This is an upper-bound estimate based on
soweego
version 1 output - ↑ https://tools.wmflabs.org/mix-n-match/
- ↑ https://github.com/Wikidata/soweego
- ↑ https://disi.unitn.it/~passerini/
Endorsements
- Support Support Can't wait to see it in action! Sannita - not just another it.wiki sysop 18:22, 14 July 2019 (UTC)
- Strong support Strong support (disclaimer: I contributed to the development of soweego 1) I strongly endorse this proposal, because I see it as the natural next step for soweego. We implemented several algorithms, picked the one that performed best, but had to put the others aside. An ensemble would definitely smooth the cons of each algorithm, thus providing the strongest results. MaxFrax96 (talk) 12:15, 15 July 2019 (UTC)
- Sounds promising. Jonathan Groß (talk) 17:21, 16 July 2019 (UTC)
- Support Support --Jaqen (talk) 16:54, 18 July 2019 (UTC)
- Support Support Looking forward to it!