Jump to content
Wikimedia Meta-Wiki

Talk:Community Tech/OCR Improvements

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Alex brollo (talk | contribs) at 16:30, 4 March 2021 (Which OCR tools do you use the most, and why? ). It may differ significantly from the current version .

Latest comment: 3 years ago by Alex brollo in topic Which OCR tools do you use the most, and why?

Launch of project: First round of feedback (January 2021)

Hello, everyone! We have just launched the project for OCR Improvements! With this project, we aim to improve the experience of using OCR tools on Wikisource. Please refer to our project page, which provides a full summary of the project and the main problem areas that we have identified. We would then love if you could answer the questions below. Your feedback is incredibly important to us and it will directly impact the choices we make. Thank you in advance, and we look forward to reading your feedback!

Have we covered all of the main OCR tools used by Wikisource editors?

Latest comment: 3 years ago 5 comments3 people in discussion
  • So far as I can tell you've covered all the interactive OCR tools that the different language Wikisourcen recommend to their users on-site (the ones I've heard of, but I haven't researched it by any means).But note that this does not cover all use cases for OCR in connection with Wikisource. For example, there are some old shell scripts provided on enWS for adding an OCR text layer to a DjVu before upload, and at least myself and Inductiveload have developed custom tools for processing a set of scan images and producing a DJVu with an OCR layer. On-wiki interactive tools represent one major category of users and uses, but the related/complementary category of users and use cases that relies on the text layer in the DjVu/PDF is not insignificant either. For this use case we're not talking about improving one tool, but rather a toolchain and infrastructure. My tool is intended to eventually become a web-based (WMCS) interactive tool to manipulate DjVu files (OCR being one part), letting power users prepare such files for other less technical users.Preserving fidelity of existing OCR text when extracting it from the file (on upload) and the database (when editing the page) is another pain point (text layers extracted from PDFs are notably poorer quality in MediaWiki than the same from DjVu files). For DjVu files with a structured text layer, the fidelity is also lost when stored as a blob in the metadata (imginfo, iirc) leading to needlessly deteriorated quality when extracted. And the structure provided by the text layers is not leveraged to provide advanced proofreading features (OCR text overlay on the scan image, offering mobile users single word or single line snippets to proofread, etc.).When you squint just right the pre-generated OCR files at IA and the existing text layer in a PDF or DjVu are just another source of OCR data (just like the Google Vision API, or a web-wrapped Tesseract service), and should fit into the overall puzzle too.All this stuff falls roughly within the "OCR" umbrella term, but is outside the scope of this Wishlist task as currently construed. My suggestion is therefore to keep these use cases in mind while working on it in order to 1) not waste effort developing functionality in this tool that is really just a workaround for something that should be fixed elsewhere, and 2) to create tasks that leverage the research and experience you accumulate on other components where the real solution lies. Personally I would love to see some attention paid to the path from an upload with a text layer, extraction and storage in the database (multiple forks / MCR, or other improved storage), and fetching and presentation (a non-imginfo based API for ProofreadPage or even a Gadget or user script to get at the structured text layer?). --Xover (talk) 08:21, 22 January 2021 (UTC) Reply
    @Xover: Apologies for the late response, and thank you for explaining this! First, just so we understand correctly, why do you use these shell scripts as opposed to on-wiki OCR tools? Is it for bulk OCR? Or for better support for certain languages? We are asking because we want to understand what the current on-wiki tools are not providing, which you may get with certain off-wiki OCR tools. Also, thank you for providing detailed information on the benefits of DjVU files. Like you wrote, improving the workflow of storing OCR text in a DjVu file (which is then brought over to Commons) is probably outside the scope of the wish proposal. However, it is very useful for the team to be aware of the fact that Wikisource users also depend upon off-wiki OCR tools in some cases, as this can give us a more holistic understanding of the range of tools available. Furthermore, we encourage you to share any insights regarding how we can improve the on-wiki tools over the course of the project, since that will be our focus. Thank you so much! --IFried (WMF) (talk) 22:07, 3 March 2021 (UTC) Reply
    @IFried (WMF): The shell scripts (s:Help:DjVu files/OCR with Tesseract) are old guidance to give contributors a way to add an OCR text layer for a page. It is still occasionally used by some contributors, but is by no means a primary solution. My custom tool is primarily designed to do bulk OCR, but its use use case spans a bit wider. It generates a DjVu file with OCR text layer from a directory of scanned page images. It has three main goals: 1) to create a new OCR text layer when one is missing, of poor quality, or corrupt (cf. T219376 and T240562); 2) to improve the image quality of a DjVu file because (e.g.) IA's DjVu's are excessively compressed or scaled down (and we need the highest fidelity page images we can get in many cases); and 3) to generate specifically a DjVu file rather than other possible formats (both due to MediaWiki's PDF handler doing a really bad job extracting the text layer from PDFs, and because DjVu files can more easily and reliably be manipulated when we need to insert, remove, shift pages, redact an image or other part that is copyrighted, etc.). In addition to this the custom tool lets me control aspects of the DjVu generation (bitonal vs. DjVuPhoto) and Tesseract. For example, since I can control page segmentation mode and language settings I can deal with things like s:Page:Konx Om Pax.pdf/16.I also have my own online OCR gadget, backed by a WMCS webservice that uses my own code (wrapping Tesseract) where I am experimenting with various features that, if they work out, may be useful. For example a switch to automatically unwrap lines within a paragraph, including combining hyphenated words; detecting and removing the first line if it represents the page header; educating or straightening quotation marks; etc.. I am also investigating the possibility of interactively selecting a portion of the page image (rectangular marquee) to OCR. This is useful for multi-column or other constellations of text where default OCR may guess incorrectly and combine lines awkwardly, or when a page contains multiple languages (for example a primarily English work that embeds passages of Indic, Hangul, Arabic, etc.). I'm also planning on offering multiple output formats from the backend service, so that those who want to make a specialized tool can ask for hOCR output. Since hOCR contains information on page geometry down to the character box level, that would enable things like overlaying the OCR text on the page image for direct comparison, or showing just a small portion of the page (a single sentence, or word by word) on a mobile phone (where full-page proofreading is effectively impossible today). I also plan to investigate automatically adding wikimarkup to the output, but that's on hold due to Tesseract lacking support for font variants (bold, italic, etc.; which are probably the most useful things to automate, since spotting italics in particular is often hard when proofreading). --Xover (talk) 08:47, 4 March 2021 (UTC) Reply
  • For me I see those differents tools : 1/ Google OCR (User:Alex brollo/GoogleOCR.js) 2/ Tesseract OCR (User:Putnik/TesseractOCR.js) 3/ OCR (User:Ineuw/OCR.js) 4/ the native OCR — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
@Koreller:Thank you for providing these examples! --IFried (WMF) (talk) 22:08, 3 March 2021 (UTC) Reply

Have we covered the major problems experienced when using OCR tools?

Latest comment: 3 years ago 5 comments4 people in discussion

RTL text

I'd like to add an issue unique to RTL languages (such as Hebrew & Arabic). In the Hebrew wikisource, the OCR gadget often fails to render properly punctuation marks, treating them as LTR-text within the general RTL text-flow. This causes problems when proofreading the text, even though intially, no issue is apparent to the viewer. The OCR gadget inserts erroneous BIDI markup characters around the punctuation marks.

Recommended solution: Allow the user to select the language of the document being OCRed, and make it RTL by default on RTL-language Wikisources.

--Thank you, Naḥum (talk) 09:48, 18 January 2021 (UTC) Reply
@Nahum: Thank you so much for this explanation! From our understanding, the issue is that punctuation (which should be RTL) is being expressed incorrectly as LTR in some cases. This is wrong and it makes it very difficult to proofread. We agree that this is a big problem and would like to investigate it more deeply. In that case, can you provide us some specific examples that we can look into? Thank you in advance! --IFried (WMF) (talk) 22:10, 3 March 2021 (UTC) Reply

Automatic/batch OCR for Indic OCR

  • In latin language wikisource have a Automatic/batch OCR for Indic OCR by a bot run phe OCR tool and it create a new text layer very PDF/DJVU file. But there are no like this in Indic language wikisources and other non-latin wikisoures. You may find phe OCR tool status page and found Indic language shown running. But no text layer created by this job. We are alwayes depend on OCR4wikisoure python script, which is break the stardard workflow of wikisource. We want this kind of automatic/batch OCR by GoogleOCR/Tesseract for Indic language wikiosurce, when we create a Index: namespace.

From last month Dec 2020, Internet archive started the batch OCR for Indic OCR by tessearct OCR , for example https://archive.org/details/beng-1-1872, they have create FULL TEXT and PDF WITH TEXT. We want this kind of batch process. --Jayantanth (talk) 15:30, 20 January 2021 (UTC) Reply

@Jayantanth: Could you please get in touch with me on my user talk page at English Wikisource with 1) a link to a specific file on Commons that Phe's OCR fails on, 2) the specifics of the language and script it is in (Bengali in Indic script?), 3) a detailed description of how you invoke Phe's OCR on that file (what actions the user takes, what buttons are clicked on, ec.), and 4) as detailed a description as possible of the result you expected and the result you actually got. I have access to the Phetools Toolforge project and would like to try to debug this problem (but I have zero familiarity with any language usually represented in Indic scripts so I will need help navigating there). --Xover (talk) 08:34, 22 January 2021 (UTC) Reply
@Jayantanth: Thank you so much for this comment! From what we gather, you are saying that Latin language Wikisources have an automatic batch OCR. However, Indic language (and other non-Latin language) Wikisources do not have a functioning automatic batch OCR tool. We think this a really important issue to look into, and we would like to fix this. As a team, we will be investigating if we can provide bulk OCR via the Google/Wikimedia OCR tool. If we did this, would this be a good solution for you? Also, your idea about making automatic bulk OCR available upon index creation is interesting and we’ll discuss it as a team. We look forward to your feedback and thank you in advance! --IFried (WMF) (talk) 22:12, 3 March 2021 (UTC) Reply

Which OCR tools do you use the most, and why?

Latest comment: 3 years ago 10 comments6 people in discussion
@Koreller: Thank you for this feedback! This is great to hear, especially since we are us considering doing work to specifically improve Google OCR. In that case, what do you think are the top things that need to be improved about Google OCR? Thank you in advance! --IFried (WMF) (talk) 22:13, 3 March 2021 (UTC) Reply
  • I use sometimes Google OCR, but I mostly work on archive.org books, where I find an excellent structured OCR (_djvu.xml, and recently hOCR). Dealing with texts that don't come from archive,org, I use a personal ABBYY FineReader application.--Alex brollo (talk) 09:12, 3 March 2021 (UTC) Reply
@Alex brollo: Thank you for sharing this! Can you let us know more about when you choose to use Google OCR vs. when you choose to use another tool, such as on archive.org? Perhaps you can give us some specific examples? We are asking because we hope to improve Google OCR during the project, so we would like to identify its greatest weaknesses and pain points for users at this time. Thank you in advance! --IFried (WMF) (talk) 22:14, 3 March 2021 (UTC) Reply
@IFried (WMF): I alwais use IA OCR (wrapped into archive.org djvu files or, by now, into djvu files built by IA Upload tool) but rare cases where column segmentation of text is wrong (when archive.org OCR engine guesses unexistant columns; a not unusual problem into play text in verse) or when I work on pages, that I didn't load personally. My present "loading style" is effective but a little difficult to explain; in brief, I download archive.org _djvu.xml file and I work on it offline, then uploading the resulting text into nsPage by bot or by mediawiki Split tool.
Using Google OCR, I noted an excellent character recognition, but sometimes some words/some groups of words are moved away from their right place - a very annoying thing. --Alex brollo (talk) 16:30, 4 March 2021 (UTC) Reply

For Hebrew

  • I use a non-free (and pretty expensive) OCR software, called ABBYY Finereader. I also use it professionally (I work for an Israeli Publishing House). The reason is that it is far superior to free OCR tools for Hebrew. I encounter much less Scaning errors with ABBYY Finereader than with any free OCR tool I have tried. However, when proofreading books uploaded by others to Commons, I use the OCR Gadget, because it is already there, unless the quality of the OCR is too poor to be usable.--Naḥum (talk) 09:51, 18 January 2021 (UTC) Reply
@Nahum: Thank you so much for this feedback! We are also curious to know your opinion on how Google OCR handles RTL text? Would you say that OCR Gadget does a better job than Google OCR -- and, if so, why? Also, is it possible to provide us some examples of the superior OCR quality that you find with ABBYY Finereader over the free tools? This would help us identify problem areas and see what solutions may be possible. Thank you in advance! --IFried (WMF) (talk) 22:15, 3 March 2021 (UTC) Reply

For Indic wikisource

@Jayantanth: Thank you for this feedback! Between Indic OCR and Google OCR, which tool do you think is the best, in your experience? When do you choose to use one tool over the other? Thank you! --IFried (WMF) (talk) 22:17, 3 March 2021 (UTC) Reply

For Neapolitan wikisource

What are the most common and frustrating issues you encounter when using OCR tools?

Latest comment: 3 years ago 10 comments9 people in discussion
  • the default OCR works well, but google is superior for marginal scans, or non latin characters. texts before around 1870 are harder for the OCR introducing more errors. OCR errors tend to be systematic per work leading some to use find and replace for repeated errors. two columns texts are a problem requiring much hand zipping Slowking4 (talk) 02:39, 18 January 2021 (UTC) Reply
@Slowking4: Thank you for this information! Overall, do you tend to use OCR Gadget ("basic OCR") or Google OCR more often, and why? Also, thank you for providing information on how Google OCR tends to work better for older scans and non-Latin languages. However, we also understand that support for older books and multiple column texts is currently lacking. For this reason, we want to analyze if we can improve multi-column books. We don’t know yet, but we’ll see! Thank you and we look forward to hearing your response to our question on whether you prefer OCR Gadget or Google OCR. Thank you! --IFried (WMF) (talk) 22:19, 3 March 2021 (UTC) Reply
thanks for the effort. i will use basic OCR first as it handles 2 columns better. but will test if google OCR is better for a work's scan, and then use it to get an un-proofread version, (red) to improve later. hard to determine the quality of the text layer, except by trial and error. sometimes, after saving an un-proofread version, then we will paste in a scrape from gutenberg. for works with a lot of greek and latin characters, (natural history survey books) google is better. for works with a lot of French accents, google is better also. the basic OCR seems to like modern fonts, so for older editions, (around 1870) google is better. for really bad scans, (around 1840) then neither work well, and text is from zero, like handwriting. (i.e. [1]) for tables, and math equations, neither work well, so we have to do by hand. google OCR is slower as basic loads on opening the new page, and for google you have to press the button. (this is English Wikisource) Slowking4 (talk) 23:35, 3 March 2021 (UTC) Reply
  • The ocr output for 2 column or multi column is not as expected. The OCR output is expected column wise. But the the actual output of the ocr is line wise first line from the first column, first line from second column and then second line from first column, 2nd line from second column. What is the desired result is first, second, third, etc from from first column and then followed by second column. In case of dictionaries it is very important. without this column recognition the OCR result is practically useless with so many rearrangements to correct it. Currently we type it directly. Previously in the tool OCR4wikisource (by @Tshrinivasan:), there was a program to specify the number of columns in the image and the tool will spilt up the image vertically and send to the OCR and collect the data sequentially and then give us the desired output. Using this approach in Tamil wikisource we have OCRed many dictionaries with 2 columns. But I don't know the problem the tool now is not supporting the multicolumn. Is there any way to mention the columns in other OCRs like google or Indic OCR? -- Balajijagadesh (talk) 02:34, 27 January 2021 (UTC) Reply
  • I have often issues with OCR with PDF files. E.g. this file was uploaded in 2019 and still is not possible to use OCR button. Google OCR works here.

And one more problem is with dictionary - OCR services probably use some dictionary and tries correct words according it, but some words in older texts are always replace with different modern (in old czech text v pravo (on the right), in modern language is vpravo so this word is not in dictionary, and OCR always (99%) writes v právo (to the law)). JAn Dudík (talk) 15:24, 28 January 2021 (UTC) Reply

  • I wish I had an option for fraktur and the related typefaces. Such an option is currently available on German wikisource, but it is relevant for a number of European languages. I work with Danish texts, where even being able to use the "language-less" Fraktur.traineddata in tesseract would often be very helpful, even though it is not quite the right alphabet. Peter Alberti (talk) 19:17, 6 February 2021 (UTC) Reply
  • The most frustrating was when I didn't know, during 1-2 months, if the existing OCR button (the native button), was really the basic one because it didn't work, which forced me to find an alternative → google OCR. One of the other problem is the use of the OCR tool on the columns, which really doesn't work well — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
  • I would like to get optionally hOCR instead of text, since its text structure can be fixed using text fragments absolute coordinates; hOCR could be used too to get some formatting suggestions (font-size, indents, centered text....) by local jQuery scripts. --Alex brollo (talk) 09:18, 3 March 2021 (UTC) Reply
  • It would be nice to have OCR just working normally on languages for which it doesn't have a dictionnary to refer to. One Neapolitan wikisource it is a disaster, that becomes worst when dealing with texts from 18th century or before. --Ruthven (msg) 19:42, 3 March 2021 (UTC) Reply

Which problems, overall, do you find the most critical to fix, and why?

Anything else you would like to add?

Latest comment: 3 years ago 3 comments3 people in discussion
  • I think it would be useful if the new OCR tool will continue to expose an API endpoint (link phetools or google-ocr) to fetch the OCR for a page, so users could build tools around it to automate tasks.
  • Providing an optional way to indicate the OCR's certainty level would be great. For example, allowing editors to switch on highlighting for the parts of the text that the OCR is less sure about. This would be a great way to allow editors to concentrate on the parts of the text that the OCR had problems with, quickly checking these parts and making appropriate fixes as necessary. --Yodin T 13:05, 31 January 2021 (UTC) Reply
  • Recognition on texts with columns is rarely good, maybe there should be a system to choose several areas manually from the image to be recognized, that would facilitate this recognition. — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
  • The existing tool (and also the google-ocr) fail when the text is set in columns. Even if it is not often it is quite annoying since this kind of text is also quite tough for many external programs. Here a few examles of pages sharing this layout: , , and Please, notice the different number of columns, rulers or empty space between columns and a header. So it would be nice if the coming tool could cope with these problematic pages. Draco flavus (talk) 19:28, 2 March 2021 (UTC) Reply

AltStyle によって変換されたページ (->オリジナル) /