Research:ReferenceRisk: Difference between revisions

Browse history interactively

← Older edit

Content deleted Content added

Visual Wikitext

Inline

Revision as of 15:36, 6 March 2024 edit

FNavas-WMF (talk | contribs)

Autopatrollers

52 edits

No edit summary

Tag: Visual edit

← Older edit

Latest revision as of 02:44, 30 August 2024 edit undo

ABaigutanova-WMF (talk | contribs)

14 edits

No edit summary

(18 intermediate revisions by 4 users not shown)

Line 1: Line 1:

<div style="(追記) padding (追記ここまで)-(追記) top (追記ここまで): (追記) 1em (追記ここまで); (追記) padding (追記ここまで)-bottom:(追記) 1em; (追記ここまで)">(追記) {{Start tab (追記ここまで)

| link-1 = User:FNavas-WMF/sandbox

[[File:Ambox_warning_blue_construction.svg|right|46x46px]][[File:Ambox_warning_blue_construction.svg|left|46x46px]]

| tab-1 = ReferenceNeed and ReferenceRisk

This research documentation page is currently under construction.

| link-2 = /Measurement plan

| tab-2 = Measurement plan

| link-3 = /Testing

| tab-3 = Testing

}}

</div>

{{Research_project

Line 15: Line 19:

| collaborators =

{{Investigator|[[User:Pablo (WMF)|Pablo Aragón]]|Wikimedia Foundation}}

{{Investigator|[~~(削除) http (削除ここまで)~~:~~(削除) //www.franciscofnnavas.com (削除ここまで)~~Aitolkyn Baigutanova]|KAIST}}

{{Investigator|[(追記) [User (追記ここまで):(追記) ABaigutanova-WMF| (追記ここまで)Aitolkyn Baigutanova(追記) ] (追記ここまで)]|KAIST}}

Line 23: Line 27:

| end_year = 2024

| end_month = ~~(削除) August (削除ここまで)~~

| end_month = (追記) September (追記ここまで)

Line 64: Line 68:

}}

-->

}}

{{nutshell|This page will hold all updates and information related to the ML scores developed by WMF Research tentatively named, ''reference need'' and ''reference risk''. These two scores seek to make it easier to understand the quality of references on Wikipedia.}}

~~(削除) }} (削除ここまで)~~ ~~(削除)  (削除ここまで)~~

(追記) == (追記ここまで) (追記) What (追記ここまで) (追記) is (追記ここまで) (追記) this (追記ここまで) project(追記) ? (追記ここまで) (追記) == (追記ここまで)

⚫

A typical Wikipedia article has three atomic units that combine to craft the claims(追記) we read (追記ここまで) —(追記) 1) (追記ここまで) the editor that creates the edit(追記) 2) (追記ここまで) the edit itself (追記) 3) (追記ここまで) the reference that informs the edit. This project focuses on the latter of the three.

⚫

(追記) Wikipedia's [[:en:Wikipedia:Verifiability|verifiability principle]], expects all editors to be responsible for the content they add, ruling that the "burden to demonstrate verifiability lies with the editor who adds or restores material". Would (追記ここまで) this edict be followed to the letter, every claim across Wikipedia would be dutifully cited inline. Of course, life falls short of perfection, and it is exactly (追記) the (追記ここまで) inherently imperfect participation of the human editor that leads to change, debate and flux, creating "quality"(追記) claims and articles, (追記ここまで) by any standard(追記) , (追記ここまで) in the long term.(追記) {{Citation (追記ここまで) (追記) needed}} (追記ここまで)

⚫

A typical Wikipedia article has three atomic units that combine to craft the claims — the editor that creates the edit~~(削除) , (削除ここまで)~~ the edit itself ~~(削除) and (削除ここまで)~~ the reference that informs the edit. This project focuses on the latter of the three. (削除) Wikipedia's [[:en:Wikipedia:Verifiability|verifiability principle]], expects all editors to be responsible for the content they add, ruling that the "burden to demonstrate verifiability lies with the editor who adds or restores material". (削除ここまで)

Then, there is the additional task of understanding the reference itself. What is in the reference? Where does it come from? Who made it? Wikipedia communities have various efforts in trying to lessen that task, namely [[:en:Wikipedia:Reliable_sources/Perennial_sources|the reliable sources list]].

⚫

~~(削除) Should (削除ここまで)~~ this edict be followed to the letter, every claim across Wikipedia would be dutifully cited inline. Of course, life falls short of perfection, and it is exactly ~~(削除) that (削除ここまで)~~ inherently imperfect participation of the human editor that leads to change, debate and flux, creating "quality" by any standard in the long term.

Yet, there is no silver-bullet solution to understanding how our communities, across languages and projects, manage citation quality.[[File:Reference Need & Reference Risk.png|thumb|476x476px|A basic visualization of this ML model]]

Then, there is he additional task of understanding the quality of a reference.

⚫

As a collaboration between [[Wikimedia Enterprise]] and Research with the set goal of refining and productionizing previous work by the Research’s Citation quality ML model from the following paper — "Longitudinal Assessment of Reference Quality on Wikipedia" (追記) <ref (追記ここまで) (追記) name="www2023"/>, (追記ここまで) (追記) we seek (追記ここまで) to(追記) lessen the burden of understanding the quality of a single reference. The result of which will (追記ここまで) cater to everyone from individual volunteer editors to high-volume third-party reusers.

[[File:Reference Need & Reference Risk.png|thumb|390x390px|A basic visualization of this ML model]]

⚫

As a collaboration between [[Wikimedia Enterprise]] and Research with the set goal of refining and productionizing previous work by the Research’s Citation quality ML model from the following paper — "~~(削除) [https://arxiv.org/pdf/2303.05227.pdf (削除ここまで)~~Longitudinal Assessment of Reference Quality on Wikipedia~~(削除) ] (削除ここまで)~~"~~(削除) , (削除ここまで)~~ ~~(削除) this (削除ここまで)~~ ~~(削除) project (削除ここまで)~~ ~~(削除) aims (削除ここまで)~~ to cater to everyone from individual volunteer editors to high-volume third-party reusers. ~~(削除) We seek to lessen the burden of understanding the quality of a reference. (削除ここまで)~~

Both Research and Enterprise understand that a broad range of actors in the online knowledge environment stand to benefit from the ability to evaluate citations at scale and in near real time. ~~(削除) Manually (削除ここまで)~~ inspecting sources or developing external algorithmic methods are costly~~(削除) for reusers (削除ここまで)~~ and ~~(削除) Research and (削除ここまで)~~ ~~(削除) Enterprise (削除ここまで)~~ would like to host a scoring model that may be leveraged by customers and the community to automatically identify low and high quality citation data.

Both Research and Enterprise understand that a broad range of actors in the online knowledge environment stand to benefit from the ability to evaluate citations at scale and in near real time. (追記) (追記ここまで)

(追記) Because manually (追記ここまで) inspecting sources or developing external algorithmic methods are costly and (追記) time-consuming (追記ここまで) (追記) we (追記ここまで) would like to host a scoring model that may be leveraged by customers and the community to automatically identify low and high quality citation data.

== Components ==

We originally operationalize reference quality in two metrics: 1) reference need which measures the proportion of claims in the article content that are missing citations, and 2) reference risk which evaluates the proportion of risky references among the ones cited in an article <ref name="www2023">Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839. https://doi.org/10.1145/3543507.3583218</ref>. Here, we elaborate on how the two scores are modified for the actual production. The two models are developed separately and can be used independently of each other.

=== Reference Need ===

Our first score is reference need. We fine-tune language model mBERT to predict the probability of a sentence in an article to require a citation. With the predicted label for each sentence, we compute the overall reference need score for an article.

The original definition of reference need is the proportion of the uncited sentences that need a citation. We make a slight modification to this definition to only consider the proportion of the uncited sentences that need a citation among uncited sentences. This means that if an editor added a reference to a sentence that sentence is considered to need a citation regardless of the model output. Hence, the model prediction is only run on uncited claims.

=== Reference Risk ===

Our second score tries to evaluate the quality of the cited sources themselves. However, since predicting reliability is inherently challenging, we instead focus on providing features that could assist the user in making a self-assessment, ultimately leaving the decision on reliability to the user. Thus, the definition of reference risk score is to evaluate the likelihood of an added reference to survive on the page, which is inferred from the edit history metadata by source.

== Findings ==

=== Reference Need ===

In this work, we fine-tune a multilingual BERT model for the reference need detection task. Our model takes a wiki_db and revision id as input and computes the reference need score for the given revision. Per sentence, the model input includes language code, section title, sentence, subsequent sentence, and preceding sentence in a paragraph. We trained on a sample of 20,000 sentences from featured articles of five wikis, English, Spanish, French, German, and Russian. Due to teh trade-off between the accuracy and latency of the model, we limit the input context size to 128 tokens, although the maximum BERT accepts is 512. More details on the model can be found in the [[Machine_learning_models/Proposed/Multilingual_reference_need|model card]]. The test data includes 3,000 sentences sampled from a holdout set of pages in our dataset. Performance of the model on the test set is reported below:

Accuracy 0.706

ROC-AUC 0.781

PR-AUC 0.783

Precision: 0.705

F1-score 0.707

</syntaxhighlight>

=== Reference Risk ===

We examine historical occurrences of domains in Wikipedia articles until the year 2024, to identify informative features. The feature we found meaningful as a reference risk indicator is the survival edit ratio, which is the proportion of edits a domain survives since its first addition to a page. For example, if page A has a total of 100 revisions until now, and ‘bbc.com’ was added to page A in its 10th revision and still remains, then the survival edit ratio of ‘bbc.com’ is 90/100.

We utilize the community-maintained perennial sources list as our ground-truth labeling. The list includes 5 categories: blacklisted (B), deprecated (D), generally unreliable (GU), no consensus (NC), and generally reliable (GR). We merge the first two categories as undesirable to use, and the last two as no risk, thus having three groups used for the comparison. The distribution of four aggregations (mean, median, 25th, and 75th percentiles) of the target feature within the three groups is shown in the plots. We observe that no-risk category sources tend to survive more edits on the article.

[[File:Meta rr 1.png|thumb|center]]

== What’s next? ==

* Post quarterly updates

* Build community-centered performance testing strategy

== Model Cards ==

* [[Machine_learning_models/Proposed/Language-agnostic_reference_risk|Language Agnostic Reference Risk Model Card]]

* [[Machine_learning_models/Proposed/Multilingual_reference_need|Multilingual Reference Need Model Card]]

== Related Projects ==

Line 90: Line 139:

* [https://www.emerald.com/insight/content/doi/10.1108/OIR-02-2023-0084/full/pdf Polarization and reliability of news sources in Wikipedia]

* [https://asistdl.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/asi.24723 Gender and country biases in Wikipedia citations to scholarly publications]

== References ==

Latest revision as of 02:44, 30 August 2024

ReferenceNeed and ReferenceRisk Measurement plan Testing

Created

15:22, 22 February 2024 (UTC)

Contact

Diego Sáez Trumper

Wikimedia Foundation

Francisco Navas

Wikimedia Enterprise

Collaborators

Pablo Aragón

Wikimedia Foundation

Aitolkyn Baigutanova

KAIST

Duration: 2024-February – 2024-September

References, Knowledge Integrity, Disinformation

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

This page in a nutshell: This page will hold all updates and information related to the ML scores developed by WMF Research tentatively named, reference need and reference risk. These two scores seek to make it easier to understand the quality of references on Wikipedia.

What is this project?

[edit ]

A typical Wikipedia article has three atomic units that combine to craft the claims we read — 1) the editor that creates the edit 2) the edit itself 3) the reference that informs the edit. This project focuses on the latter of the three.

Wikipedia's verifiability principle, expects all editors to be responsible for the content they add, ruling that the "burden to demonstrate verifiability lies with the editor who adds or restores material". Would this edict be followed to the letter, every claim across Wikipedia would be dutifully cited inline. Of course, life falls short of perfection, and it is exactly the inherently imperfect participation of the human editor that leads to change, debate and flux, creating "quality" claims and articles, by any standard, in the long term.^{[citation needed]}

Then, there is the additional task of understanding the reference itself. What is in the reference? Where does it come from? Who made it? Wikipedia communities have various efforts in trying to lessen that task, namely the reliable sources list.

Yet, there is no silver-bullet solution to understanding how our communities, across languages and projects, manage citation quality.

A basic visualization of this ML model

As a collaboration between Wikimedia Enterprise and Research with the set goal of refining and productionizing previous work by the Research’s Citation quality ML model from the following paper — "Longitudinal Assessment of Reference Quality on Wikipedia" ^[1], we seek to lessen the burden of understanding the quality of a single reference. The result of which will cater to everyone from individual volunteer editors to high-volume third-party reusers.

Both Research and Enterprise understand that a broad range of actors in the online knowledge environment stand to benefit from the ability to evaluate citations at scale and in near real time.

Because manually inspecting sources or developing external algorithmic methods are costly and time-consuming we would like to host a scoring model that may be leveraged by customers and the community to automatically identify low and high quality citation data.

Components

[edit ]

We originally operationalize reference quality in two metrics: 1) reference need which measures the proportion of claims in the article content that are missing citations, and 2) reference risk which evaluates the proportion of risky references among the ones cited in an article ^[1]. Here, we elaborate on how the two scores are modified for the actual production. The two models are developed separately and can be used independently of each other.

Reference Need

[edit ]

Our first score is reference need. We fine-tune language model mBERT to predict the probability of a sentence in an article to require a citation. With the predicted label for each sentence, we compute the overall reference need score for an article.

The original definition of reference need is the proportion of the uncited sentences that need a citation. We make a slight modification to this definition to only consider the proportion of the uncited sentences that need a citation among uncited sentences. This means that if an editor added a reference to a sentence that sentence is considered to need a citation regardless of the model output. Hence, the model prediction is only run on uncited claims.

Reference Risk

[edit ]

Our second score tries to evaluate the quality of the cited sources themselves. However, since predicting reliability is inherently challenging, we instead focus on providing features that could assist the user in making a self-assessment, ultimately leaving the decision on reliability to the user. Thus, the definition of reference risk score is to evaluate the likelihood of an added reference to survive on the page, which is inferred from the edit history metadata by source.

Findings

[edit ]

Reference Need

[edit ]

In this work, we fine-tune a multilingual BERT model for the reference need detection task. Our model takes a wiki_db and revision id as input and computes the reference need score for the given revision. Per sentence, the model input includes language code, section title, sentence, subsequent sentence, and preceding sentence in a paragraph. We trained on a sample of 20,000 sentences from featured articles of five wikis, English, Spanish, French, German, and Russian. Due to teh trade-off between the accuracy and latency of the model, we limit the input context size to 128 tokens, although the maximum BERT accepts is 512. More details on the model can be found in the model card. The test data includes 3,000 sentences sampled from a holdout set of pages in our dataset. Performance of the model on the test set is reported below:

Accuracy 0.706
ROC-AUC 0.781
PR-AUC 0.783
Precision: 0.705
F1-score 0.707

Reference Risk

[edit ]

We examine historical occurrences of domains in Wikipedia articles until the year 2024, to identify informative features. The feature we found meaningful as a reference risk indicator is the survival edit ratio, which is the proportion of edits a domain survives since its first addition to a page. For example, if page A has a total of 100 revisions until now, and ‘bbc.com’ was added to page A in its 10th revision and still remains, then the survival edit ratio of ‘bbc.com’ is 90/100.

We utilize the community-maintained perennial sources list as our ground-truth labeling. The list includes 5 categories: blacklisted (B), deprecated (D), generally unreliable (GU), no consensus (NC), and generally reliable (GR). We merge the first two categories as undesirable to use, and the last two as no risk, thus having three groups used for the comparison. The distribution of four aggregations (mean, median, 25th, and 75th percentiles) of the target feature within the three groups is shown in the plots. We observe that no-risk category sources tend to survive more edits on the article.

What’s next?

[edit ]

Post quarterly updates
Build community-centered performance testing strategy

Model Cards

[edit ]

Related Projects

[edit ]

References

[edit ]

↑ ^a ^b Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of the ACM Web Conference 2023 (WWW '23). Association for Computing Machinery, New York, NY, USA, 2831–2839. https://doi.org/10.1145/3543507.3583218

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Research:ReferenceRisk&oldid=27369247"

Latest revision as of 02:44, 30 August 2024

What is this project?

Components

Reference Need

Reference Risk

Findings

Reference Need

Reference Risk

What’s next?

Model Cards

Related Projects

Related Reading

References