Talk:Community Tech/OCR Improvements
Launch of project: First round of feedback (January 2021)
Hello, everyone! We have just launched the project for OCR Improvements! With this project, we aim to improve the experience of using OCR tools on Wikisource. Please refer to our project page, which provides a full summary of the project and the main problem areas that we have identified. We would then love if you could answer the questions below. Your feedback is incredibly important to us and it will directly impact the choices we make. Thank you in advance, and we look forward to reading your feedback!
Have we covered all of the main OCR tools used by Wikisource editors?
- So far as I can tell you've covered all the interactive OCR tools that the different language Wikisourcen recommend to their users on-site (the ones I've heard of, but I haven't researched it by any means).But note that this does not cover all use cases for OCR in connection with Wikisource. For example, there are some old shell scripts provided on enWS for adding an OCR text layer to a DjVu before upload, and at least myself and Inductiveload have developed custom tools for processing a set of scan images and producing a DJVu with an OCR layer. On-wiki interactive tools represent one major category of users and uses, but the related/complementary category of users and use cases that relies on the text layer in the DjVu/PDF is not insignificant either. For this use case we're not talking about improving one tool, but rather a toolchain and infrastructure. My tool is intended to eventually become a web-based (WMCS) interactive tool to manipulate DjVu files (OCR being one part), letting power users prepare such files for other less technical users.Preserving fidelity of existing OCR text when extracting it from the file (on upload) and the database (when editing the page) is another pain point (text layers extracted from PDFs are notably poorer quality in MediaWiki than the same from DjVu files). For DjVu files with a structured text layer, the fidelity is also lost when stored as a blob in the metadata (imginfo, iirc) leading to needlessly deteriorated quality when extracted. And the structure provided by the text layers is not leveraged to provide advanced proofreading features (OCR text overlay on the scan image, offering mobile users single word or single line snippets to proofread, etc.).When you squint just right the pre-generated OCR files at IA and the existing text layer in a PDF or DjVu are just another source of OCR data (just like the Google Vision API, or a web-wrapped Tesseract service), and should fit into the overall puzzle too.All this stuff falls roughly within the "OCR" umbrella term, but is outside the scope of this Wishlist task as currently construed. My suggestion is therefore to keep these use cases in mind while working on it in order to 1) not waste effort developing functionality in this tool that is really just a workaround for something that should be fixed elsewhere, and 2) to create tasks that leverage the research and experience you accumulate on other components where the real solution lies. Personally I would love to see some attention paid to the path from an upload with a text layer, extraction and storage in the database (multiple forks / MCR, or other improved storage), and fetching and presentation (a non-imginfo based API for ProofreadPage or even a Gadget or user script to get at the structured text layer?). --Xover (talk) 08:21, 22 January 2021 (UTC) Reply
- This script is available in pywikibot to query phetools or google-ocr and upload text, given an Index page (and the related file): https://github.com/wikimedia/pywikibot/blob/master/scripts/wikisourcetext.py.
- For me I see those differents tools : 1/ Google OCR (User:Alex brollo/GoogleOCR.js) 2/ Tesseract OCR (User:Putnik/TesseractOCR.js) 3/ OCR (User:Ineuw/OCR.js) 4/ the native OCR — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
Have we covered the major problems experienced when using OCR tools?
RTL text
I'd like to add an issue unique to RTL languages (such as Hebrew & Arabic). In the Hebrew wikisource, the OCR gadget often fails to render properly punctuation marks, treating them as LTR-text within the general RTL text-flow. This causes problems when proofreading the text, even though intially, no issue is apparent to the viewer. The OCR gadget inserts erroneous BIDI markup characters around the punctuation marks.
Recommended solution: Allow the user to select the language of the document being OCRed, and make it RTL by default on RTL-language Wikisources.
- --Thank you, Naḥum (talk) 09:48, 18 January 2021 (UTC) Reply
Automatic/batch OCR for Indic OCR
- In latin language wikisource have a Automatic/batch OCR for Indic OCR by a bot run phe OCR tool and it create a new text layer very PDF/DJVU file. But there are no like this in Indic language wikisources and other non-latin wikisoures. You may find phe OCR tool status page and found Indic language shown running. But no text layer created by this job. We are alwayes depend on OCR4wikisoure python script, which is break the stardard workflow of wikisource. We want this kind of automatic/batch OCR by GoogleOCR/Tesseract for Indic language wikiosurce, when we create a Index: namespace.
From last month Dec 2020, Internet archive started the batch OCR for Indic OCR by tessearct OCR , for example https://archive.org/details/beng-1-1872, they have create FULL TEXT and PDF WITH TEXT. We want this kind of batch process. --Jayantanth (talk) 15:30, 20 January 2021 (UTC) Reply
- @Jayantanth: Could you please get in touch with me on my user talk page at English Wikisource with 1) a link to a specific file on Commons that Phe's OCR fails on, 2) the specifics of the language and script it is in (Bengali in Indic script?), 3) a detailed description of how you invoke Phe's OCR on that file (what actions the user takes, what buttons are clicked on, ec.), and 4) as detailed a description as possible of the result you expected and the result you actually got. I have access to the Phetools Toolforge project and would like to try to debug this problem (but I have zero familiarity with any language usually represented in Indic scripts so I will need help navigating there). --Xover (talk) 08:34, 22 January 2021 (UTC) Reply
Which OCR tools do you use the most, and why?
- I use Google OCR because he is significantly faster than other OCR I have, and I find it better than the other — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
- I use sometimes Google OCR, but I mostly work on archive.org books, where I find an excellent structured OCR (_djvu.xml, and recently hOCR). Dealing with texts that don't come from archive,org, I use a personal ABBYY FineReader application.--Alex brollo (talk) 09:12, 3 March 2021 (UTC) Reply
For Hebrew
- I use a non-free (and pretty expensive) OCR software, called ABBYY Finereader. I also use it professionally (I work for an Israeli Publishing House). The reason is that it is far superior to free OCR tools for Hebrew. I encounter much less Scaning errors with ABBYY Finereader than with any free OCR tool I have tried. However, when proofreading books uploaded by others to Commons, I use the OCR Gadget, because it is already there, unless the quality of the OCR is too poor to be usable.--Naḥum (talk) 09:51, 18 January 2021 (UTC) Reply
For Indic wikisource
- Most of the cases, most of the community user IndicOCR/GoogleOCR by one click OR mass OCR by OCR4wikisource python script.Jayantanth (talk) 15:32, 20 January 2021 (UTC) Reply
What are the most common and frustrating issues you encounter when using OCR tools?
- the default OCR works well, but google is superior for marginal scans, or non latin characters. texts before around 1870 are harder for the OCR introducing more errors. OCR errors tend to be systematic per work leading some to use find and replace for repeated errors. two columns texts are a problem requiring much hand zipping Slowking4 (talk) 02:39, 18 January 2021 (UTC) Reply
- We have click and wait for OCR by clicking OCR buttom, some times OCR out not good. Jayantanth (talk) 15:41, 20 January 2021 (UTC) Reply
- The ocr output for 2 column or multi column is not as expected. The OCR output is expected column wise. But the the actual output of the ocr is line wise first line from the first column, first line from second column and then second line from first column, 2nd line from second column. What is the desired result is first, second, third, etc from from first column and then followed by second column. In case of dictionaries it is very important. without this column recognition the OCR result is practically useless with so many rearrangements to correct it. Currently we type it directly. Previously in the tool OCR4wikisource (by @Tshrinivasan:), there was a program to specify the number of columns in the image and the tool will spilt up the image vertically and send to the OCR and collect the data sequentially and then give us the desired output. Using this approach in Tamil wikisource we have OCRed many dictionaries with 2 columns. But I don't know the problem the tool now is not supporting the multicolumn. Is there any way to mention the columns in other OCRs like google or Indic OCR? -- Balajijagadesh (talk) 02:34, 27 January 2021 (UTC) Reply
- I have often issues with OCR with PDF files. E.g. this file was uploaded in 2019 and still is not possible to use OCR button. Google OCR works here.
And one more problem is with dictionary - OCR services probably use some dictionary and tries correct words according it, but some words in older texts are always replace with different modern (in old czech text v pravo (on the right), in modern language is vpravo so this word is not in dictionary, and OCR always (99%) writes v právo (to the law)). JAn Dudík (talk) 15:24, 28 January 2021 (UTC) Reply
- I wish I had an option for fraktur and the related typefaces. Such an option is currently available on German wikisource, but it is relevant for a number of European languages. I work with Danish texts, where even being able to use the "language-less" Fraktur.traineddata in tesseract would often be very helpful, even though it is not quite the right alphabet. Peter Alberti (talk) 19:17, 6 February 2021 (UTC) Reply
- The most frustrating was when I didn't know, during 1-2 months, if the existing OCR button (the native button), was really the basic one because it didn't work, which forced me to find an alternative → google OCR. One of the other problem is the use of the OCR tool on the columns, which really doesn't work well — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
- I would like to get optionally hOCR instead of text, since its text structure can be fixed using text fragments absolute coordinates; hOCR could be used too to get some formatting suggestions (font-size, indents, centered text....) by local jQuery scripts. --Alex brollo (talk) 09:18, 3 March 2021 (UTC) Reply
Which problems, overall, do you find the most critical to fix, and why?
Anything else you would like to add?
- I think it would be useful if the new OCR tool will continue to expose an API endpoint (link phetools or google-ocr) to fetch the OCR for a page, so users could build tools around it to automate tasks.
- Providing an optional way to indicate the OCR's certainty level would be great. For example, allowing editors to switch on highlighting for the parts of the text that the OCR is less sure about. This would be a great way to allow editors to concentrate on the parts of the text that the OCR had problems with, quickly checking these parts and making appropriate fixes as necessary. --Yodin T 13:05, 31 January 2021 (UTC) Reply
- Recognition on texts with columns is rarely good, maybe there should be a system to choose several areas manually from the image to be recognized, that would facilitate this recognition. — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
- The existing tool (and also the google-ocr) fail when the text is set in columns. Even if it is not often it is quite annoying since this kind of text is also quite tough for many external programs. Here a few examles of pages sharing this layout: , , and Please, notice the different number of columns, rulers or empty space between columns and a header. So it would be nice if the coming tool could cope with these problematic pages. Draco flavus (talk) 19:28, 2 March 2021 (UTC) Reply