Jump to content
Wikimedia Meta-Wiki

Talk:Community Tech/OCR Improvements

From Meta, a Wikimedia project coordination wiki
This is an archived version of this page, as edited by Ineuw (talk | contribs) at 20:17, 15 April 2021 (Current version is very good but some tweaks are needed.: new section). It may differ significantly from the current version .

Launch of project: First round of feedback (January 2021)

Hello, everyone! We have just launched the project for OCR Improvements! With this project, we aim to improve the experience of using OCR tools on Wikisource. Please refer to our project page, which provides a full summary of the project and the main problem areas that we have identified. We would then love if you could answer the questions below. Your feedback is incredibly important to us and it will directly impact the choices we make. Thank you in advance, and we look forward to reading your feedback!

Have we covered all of the main OCR tools used by Wikisource editors?

Latest comment: 3 years ago 8 comments5 people in discussion
  • So far as I can tell you've covered all the interactive OCR tools that the different language Wikisourcen recommend to their users on-site (the ones I've heard of, but I haven't researched it by any means).But note that this does not cover all use cases for OCR in connection with Wikisource. For example, there are some old shell scripts provided on enWS for adding an OCR text layer to a DjVu before upload, and at least myself and Inductiveload have developed custom tools for processing a set of scan images and producing a DJVu with an OCR layer. On-wiki interactive tools represent one major category of users and uses, but the related/complementary category of users and use cases that relies on the text layer in the DjVu/PDF is not insignificant either. For this use case we're not talking about improving one tool, but rather a toolchain and infrastructure. My tool is intended to eventually become a web-based (WMCS) interactive tool to manipulate DjVu files (OCR being one part), letting power users prepare such files for other less technical users.Preserving fidelity of existing OCR text when extracting it from the file (on upload) and the database (when editing the page) is another pain point (text layers extracted from PDFs are notably poorer quality in MediaWiki than the same from DjVu files). For DjVu files with a structured text layer, the fidelity is also lost when stored as a blob in the metadata (imginfo, iirc) leading to needlessly deteriorated quality when extracted. And the structure provided by the text layers is not leveraged to provide advanced proofreading features (OCR text overlay on the scan image, offering mobile users single word or single line snippets to proofread, etc.).When you squint just right the pre-generated OCR files at IA and the existing text layer in a PDF or DjVu are just another source of OCR data (just like the Google Vision API, or a web-wrapped Tesseract service), and should fit into the overall puzzle too.All this stuff falls roughly within the "OCR" umbrella term, but is outside the scope of this Wishlist task as currently construed. My suggestion is therefore to keep these use cases in mind while working on it in order to 1) not waste effort developing functionality in this tool that is really just a workaround for something that should be fixed elsewhere, and 2) to create tasks that leverage the research and experience you accumulate on other components where the real solution lies. Personally I would love to see some attention paid to the path from an upload with a text layer, extraction and storage in the database (multiple forks / MCR, or other improved storage), and fetching and presentation (a non-imginfo based API for ProofreadPage or even a Gadget or user script to get at the structured text layer?). --Xover (talk) 08:21, 22 January 2021 (UTC) Reply
    @Xover: Apologies for the late response, and thank you for explaining this! First, just so we understand correctly, why do you use these shell scripts as opposed to on-wiki OCR tools? Is it for bulk OCR? Or for better support for certain languages? We are asking because we want to understand what the current on-wiki tools are not providing, which you may get with certain off-wiki OCR tools. Also, thank you for providing detailed information on the benefits of DjVU files. Like you wrote, improving the workflow of storing OCR text in a DjVu file (which is then brought over to Commons) is probably outside the scope of the wish proposal. However, it is very useful for the team to be aware of the fact that Wikisource users also depend upon off-wiki OCR tools in some cases, as this can give us a more holistic understanding of the range of tools available. Furthermore, we encourage you to share any insights regarding how we can improve the on-wiki tools over the course of the project, since that will be our focus. Thank you so much! --IFried (WMF) (talk) 22:07, 3 March 2021 (UTC) Reply
    @IFried (WMF): The shell scripts (s:Help:DjVu files/OCR with Tesseract) are old guidance to give contributors a way to add an OCR text layer for a page. It is still occasionally used by some contributors, but is by no means a primary solution. My custom tool is primarily designed to do bulk OCR, but its use use case spans a bit wider. It generates a DjVu file with OCR text layer from a directory of scanned page images. It has three main goals: 1) to create a new OCR text layer when one is missing, of poor quality, or corrupt (cf. T219376 and T240562); 2) to improve the image quality of a DjVu file because (e.g.) IA's DjVu's are excessively compressed or scaled down (and we need the highest fidelity page images we can get in many cases); and 3) to generate specifically a DjVu file rather than other possible formats (both due to MediaWiki's PDF handler doing a really bad job extracting the text layer from PDFs, and because DjVu files can more easily and reliably be manipulated when we need to insert, remove, shift pages, redact an image or other part that is copyrighted, etc.). In addition to this the custom tool lets me control aspects of the DjVu generation (bitonal vs. DjVuPhoto) and Tesseract. For example, since I can control page segmentation mode and language settings I can deal with things like s:Page:Konx Om Pax.pdf/16.I also have my own online OCR gadget, backed by a WMCS webservice that uses my own code (wrapping Tesseract) where I am experimenting with various features that, if they work out, may be useful. For example a switch to automatically unwrap lines within a paragraph, including combining hyphenated words; detecting and removing the first line if it represents the page header; educating or straightening quotation marks; etc.. I am also investigating the possibility of interactively selecting a portion of the page image (rectangular marquee) to OCR. This is useful for multi-column or other constellations of text where default OCR may guess incorrectly and combine lines awkwardly, or when a page contains multiple languages (for example a primarily English work that embeds passages of Indic, Hangul, Arabic, etc.). I'm also planning on offering multiple output formats from the backend service, so that those who want to make a specialized tool can ask for hOCR output. Since hOCR contains information on page geometry down to the character box level, that would enable things like overlaying the OCR text on the page image for direct comparison, or showing just a small portion of the page (a single sentence, or word by word) on a mobile phone (where full-page proofreading is effectively impossible today). I also plan to investigate automatically adding wikimarkup to the output, but that's on hold due to Tesseract lacking support for font variants (bold, italic, etc.; which are probably the most useful things to automate, since spotting italics in particular is often hard when proofreading). --Xover (talk) 08:47, 4 March 2021 (UTC) Reply
  • @Xover and IFried (WMF): This conversation gives me an idea. What if we were to write our own version of Google Recaptcha using the precise position of hOCR. The New York Times used Google Recaptcha to digitize it's own back issues [1]. We could enable the Wiki Recaptcha for all non-user edits and make the vandals help us proofread books. We could also allow users to toggle Wiki Recaptcha if they want to help us proofread one word at a time. Certainly, we'd still have to add formatting and do a proofreading, but it could really help, especially with older texts. Languageseeker (talk) 05:29, 12 March 2021 (UTC) Reply
  • For me I see those differents tools : 1/ Google OCR (User:Alex brollo/GoogleOCR.js) 2/ Tesseract OCR (User:Putnik/TesseractOCR.js) 3/ OCR (User:Ineuw/OCR.js) 4/ the native OCR — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
@Koreller:Thank you for providing these examples! --IFried (WMF) (talk) 22:08, 3 March 2021 (UTC) Reply
  • on frwikisource, because of the long-term anavailability of (Tesseract) OCR, we took the habit of asking a contributor, who has ABBYY as personal OCR, to OCR books before importing them on the project : the OCR are improved, but it means relying on the goodwill of a single contributor to get good OCR. - Personaly, I also have the habit of uploading (PD) books on Archive.org, to get them OCR-ed by ABBYY, then importing them on Commons through IAupload, which on-the-fly converts files to DjVu, but it's too complex a process to rely on for new contributors, who simply want to upload a book and correct it. If ABBYY could be available (maybe with conditions) as a subsidiary tool in the import process of a file, it could be a real improvement. --Hsarrazin (talk) 08:36, 9 March 2021 (UTC) Reply
    @Hsarrazin: Thank you so much for this feedback! We appreciate you explaining in detail why some users have turned to ABBYY, as well as the complications in such a workflow. We don’t know if we can add ABBYY (since it is normally a paid service), but we will investigate if it is possible. Meanwhile, we can aim to improve accessibility of Tesseract. In fact, we are now working to add Tesseract to Wikimedia/Google OCR. So, we have one follow-up question: Can you let us know if Tesseract is still unavailable for you? If so, can you provide more details on how it is not working for you? Thanks! --IFried (WMF) (talk) 18:10, 15 April 2021 (UTC) Reply

Have we covered the major problems experienced when using OCR tools?

Latest comment: 3 years ago 9 comments6 people in discussion

RTL text

I'd like to add an issue unique to RTL languages (such as Hebrew & Arabic). In the Hebrew wikisource, the OCR gadget often fails to render properly punctuation marks, treating them as LTR-text within the general RTL text-flow. This causes problems when proofreading the text, even though intially, no issue is apparent to the viewer. The OCR gadget inserts erroneous BIDI markup characters around the punctuation marks.

Recommended solution: Allow the user to select the language of the document being OCRed, and make it RTL by default on RTL-language Wikisources.

--Thank you, Naḥum (talk) 09:48, 18 January 2021 (UTC) Reply
@Nahum: Thank you so much for this explanation! From our understanding, the issue is that punctuation (which should be RTL) is being expressed incorrectly as LTR in some cases. This is wrong and it makes it very difficult to proofread. We agree that this is a big problem and would like to investigate it more deeply. In that case, can you provide us some specific examples that we can look into? Thank you in advance! --IFried (WMF) (talk) 22:10, 3 March 2021 (UTC) Reply

Automatic/batch OCR for Indic OCR

  • In latin language wikisource have a Automatic/batch OCR for Indic OCR by a bot run phe OCR tool and it create a new text layer very PDF/DJVU file. But there are no like this in Indic language wikisources and other non-latin wikisoures. You may find phe OCR tool status page and found Indic language shown running. But no text layer created by this job. We are alwayes depend on OCR4wikisoure python script, which is break the stardard workflow of wikisource. We want this kind of automatic/batch OCR by GoogleOCR/Tesseract for Indic language wikiosurce, when we create a Index: namespace.

From last month Dec 2020, Internet archive started the batch OCR for Indic OCR by tessearct OCR , for example https://archive.org/details/beng-1-1872, they have create FULL TEXT and PDF WITH TEXT. We want this kind of batch process. --Jayantanth (talk) 15:30, 20 January 2021 (UTC) Reply

@Jayantanth: Could you please get in touch with me on my user talk page at English Wikisource with 1) a link to a specific file on Commons that Phe's OCR fails on, 2) the specifics of the language and script it is in (Bengali in Indic script?), 3) a detailed description of how you invoke Phe's OCR on that file (what actions the user takes, what buttons are clicked on, ec.), and 4) as detailed a description as possible of the result you expected and the result you actually got. I have access to the Phetools Toolforge project and would like to try to debug this problem (but I have zero familiarity with any language usually represented in Indic scripts so I will need help navigating there). --Xover (talk) 08:34, 22 January 2021 (UTC) Reply
@Jayantanth: Thank you so much for this comment! From what we gather, you are saying that Latin language Wikisources have an automatic batch OCR. However, Indic language (and other non-Latin language) Wikisources do not have a functioning automatic batch OCR tool. We think this a really important issue to look into, and we would like to fix this. As a team, we will be investigating if we can provide bulk OCR via the Google/Wikimedia OCR tool. If we did this, would this be a good solution for you? Also, your idea about making automatic bulk OCR available upon index creation is interesting and we’ll discuss it as a team. We look forward to your feedback and thank you in advance! --IFried (WMF) (talk) 22:12, 3 March 2021 (UTC) Reply

Not so visionary

You have indeed described "the major problems experienced when using OCR tools", but this is not enough. Some problems that need to be addressed, now or later, have not yet been fully experienced.

  • When OCR is mostly correct, but columns are misinterpreted, e.g. lines are interleaved across columns, the OCR tool user interface should show where the columns are, so that the user can redraw the columns and then ask for a new OCR within the new column definitions. (When you run ABBYY Finereader Professional on a stand-alone PC, this is part of its user interface.) For this to be implemented, the OCR software needs to output where its columns are.
    Agree, adding that not only lines interleaved across columns but pictures may be a problem too. With interleaved lines like titles (see e. g. here) the columns usually follow thus: left column above the title–right column above the title–the title–left column below–right column below. With pictures (see e. g. here) the succession of columns is usually different: left column above the picture–left column below–right above–right below. Enabling the user to redefine the columns would really help when proofreading newspapers and magazines. --Jan Kameníček (talk) 09:28, 9 March 2021 (UTC) Reply
@Jan.Kamenicek: Thank you you so much for this feedback! We also agree that the multiple columns issue a pain point for users. For this reason, we have launched an investigation to see how the issue can be improved (T277192), which the team engineers are looking into. Meanwhile, we have heard from other users about the benefits of ABBYY. We don’t know if we can add it, since it is normally a paid service, but we will investigate. If we have any updates on this issue, we’ll be sure to share them on the project page. Thank you again! --IFried (WMF) (talk) 18:12, 15 April 2021 (UTC) Reply
  • OCR needs to work well on very large pages with many columns, e.g. newspapers. The Internet Archive announced in August 2020 that they are exploring this.
  • When pages are proofread, the words that are corrected need to be fed back into the OCR process. If the OCR text contains bam because barn was missing from its dictionary, my correction of bam to barn should feed barn into the OCR dictionary. Other pages with OCR text that contains bam, and that have been OCRed with the same old dictionary, also need to be updated. This requires a whole new level of bookkeeping. The problem is that we regard the OCR process as an unknown black box. We don't fully control which dictionary it uses. For this to work, we need to know a lot more about how the OCR process works.

--LA2 (talk) 18:41, 4 March 2021 (UTC) Reply

@LA2: Thank you for this comment! First, regarding looking into multiple column support, we have already begun investigating this issue (T277192), and we’ll try to look into what the Internet Archive is doing as well. Second, regarding your suggestion about updating the dictionary, this is probably out of the scope of the project, unfortunately. However, if this interests you, we encourage you to submit as a separate wish in the 2022 survey later this year. --IFried (WMF) (talk) 18:14, 15 April 2021 (UTC) Reply

Which OCR tools do you use the most, and why?

Latest comment: 4 years ago 15 comments10 people in discussion
@Koreller: Thank you for this feedback! This is great to hear, especially since we are us considering doing work to specifically improve Google OCR. In that case, what do you think are the top things that need to be improved about Google OCR? Thank you in advance! --IFried (WMF) (talk) 22:13, 3 March 2021 (UTC) Reply
@IFried (WMF): I think it would take :
  • (important imo) make the tool native (i.e. accessible by default)
  • (important imo) select only an area to make OCR work (for example to Ocerize a page with two columns, or only part of a page)
  • remove the hyphen "-" for hyphenated words when OCR is used
  • transform the typographic apostrophe into a curved apostrophe
  • (important imo) maybe it is possible to make settings on the OCR ? (I don't know if it is possible but if it is it can be interesting) — Koreller (talk) 20:43, 4 March 2021 (UTC) Reply
  • I use sometimes Google OCR, but I mostly work on archive.org books, where I find an excellent structured OCR (_djvu.xml, and recently hOCR). Dealing with texts that don't come from archive,org, I use a personal ABBYY FineReader application.--Alex brollo (talk) 09:12, 3 March 2021 (UTC) Reply
@Alex brollo: Thank you for sharing this! Can you let us know more about when you choose to use Google OCR vs. when you choose to use another tool, such as on archive.org? Perhaps you can give us some specific examples? We are asking because we hope to improve Google OCR during the project, so we would like to identify its greatest weaknesses and pain points for users at this time. Thank you in advance! --IFried (WMF) (talk) 22:14, 3 March 2021 (UTC) Reply
@IFried (WMF): I alwais use IA OCR (wrapped into archive.org djvu files or, by now, into djvu files built by IA Upload tool) but rare cases where column segmentation of text is wrong (when archive.org OCR engine guesses unexistant columns; a not unusual problem into play text in verse) or when I work on pages, that I didn't load personally. My present "loading style" is effective but a little difficult to explain; in brief, I download archive.org _djvu.xml file and I work on it offline, then uploading the resulting text into nsPage by bot or by mediawiki Split tool.
Using Google OCR, I noted an excellent character recognition, but sometimes some words/some groups of words are moved away from their right place - a very annoying thing. --Alex brollo (talk) 16:30, 4 March 2021 (UTC) Reply
  • on frwikisource, we only have the Phetool OCR (Tesseract), which, after having been unavailable for months (almost a year I think), has finally been fixed, a few months ago... thanks to the nice devs who finally found what the problem was (something about not emptying de memory if I remember well)... - which is why many of us took the habit of asking a contributor who has ABBYY to OCR books before we correct them, but this means relying heavily on the goodwill of a single contributor Template:Wink

Generally, I find Tesseract tool very reliable on recent books (19th or 20th century), generally improving the recognition from older Gallica or Google scans - but never better than ABBYY... -- On old texts (18th and before), no OCR is really reliable, but contributors have developped scripts that allow automatic correction of fairly frequent errors, and that makes proofreading easier... -- I would be glad to test the Google OCR, if I could activate it on frwikisource - but it is not available in gadgets. -- If it was possible to have ABBYY Finereader tool for difficult texts (some online version), I think it could really be very interesting for difficult books --Hsarrazin (talk) 08:19, 9 March 2021 (UTC) Reply

  • I often use Google OCR which has improved very much in the last year or two. It often produces better results than original OCRs of files from HathiTrust or Archive.org. Mediawiki also extracts original OCR layers of PDF documents very badly and so I use Google OCR to replace it. There are still some serious problems with it which I will describe in the sections below. --Jan Kameníček (talk) 10:16, 9 March 2021 (UTC) Reply
  • On Bangla Wikisource, I just use GoogleOCR, but it sometimes fails to render old characters. Like a weird full-stop, or some page ends. --Greatder (talk) 05:22, 17 March 2021 (UTC) Reply

For Hebrew

  • I use a non-free (and pretty expensive) OCR software, called ABBYY Finereader. I also use it professionally (I work for an Israeli Publishing House). The reason is that it is far superior to free OCR tools for Hebrew. I encounter much less Scaning errors with ABBYY Finereader than with any free OCR tool I have tried. However, when proofreading books uploaded by others to Commons, I use the OCR Gadget, because it is already there, unless the quality of the OCR is too poor to be usable.--Naḥum (talk) 09:51, 18 January 2021 (UTC) Reply
@Nahum: Thank you so much for this feedback! We are also curious to know your opinion on how Google OCR handles RTL text? Would you say that OCR Gadget does a better job than Google OCR -- and, if so, why? Also, is it possible to provide us some examples of the superior OCR quality that you find with ABBYY Finereader over the free tools? This would help us identify problem areas and see what solutions may be possible. Thank you in advance! --IFried (WMF) (talk) 22:15, 3 March 2021 (UTC) Reply

For Indic wikisource

@Jayantanth: Thank you for this feedback! Between Indic OCR and Google OCR, which tool do you think is the best, in your experience? When do you choose to use one tool over the other? Thank you! --IFried (WMF) (talk) 22:17, 3 March 2021 (UTC) Reply

For Neapolitan wikisource

Book to Test Support for Non-English

Stumbled across this book that has over 500 languages higher quality scans lower quality scans. It's probably a great way to test multilingual support. Languageseeker (talk) 05:42, 13 March 2021 (UTC)|partial, higher quality scansReply

What are the most common and frustrating issues you encounter when using OCR tools?

Latest comment: 3 years ago 21 comments10 people in discussion
  • the default OCR works well, but google is superior for marginal scans, or non latin characters. texts before around 1870 are harder for the OCR introducing more errors. OCR errors tend to be systematic per work leading some to use find and replace for repeated errors. two columns texts are a problem requiring much hand zipping Slowking4 (talk) 02:39, 18 January 2021 (UTC) Reply
@Slowking4: Thank you for this information! Overall, do you tend to use OCR Gadget ("basic OCR") or Google OCR more often, and why? Also, thank you for providing information on how Google OCR tends to work better for older scans and non-Latin languages. However, we also understand that support for older books and multiple column texts is currently lacking. For this reason, we want to analyze if we can improve multi-column books. We don’t know yet, but we’ll see! Thank you and we look forward to hearing your response to our question on whether you prefer OCR Gadget or Google OCR. Thank you! --IFried (WMF) (talk) 22:19, 3 March 2021 (UTC) Reply
thanks for the effort. i will use basic OCR first as it handles 2 columns better. but will test if google OCR is better for a work's scan, and then use it to get an un-proofread version, (red) to improve later. hard to determine the quality of the text layer, except by trial and error. sometimes, after saving an un-proofread version, then we will paste in a scrape from gutenberg. for works with a lot of greek and latin characters, (natural history survey books) google is better. for works with a lot of French accents, google is better also. the basic OCR seems to like modern fonts, so for older editions, (around 1870) google is better. for really bad scans, (around 1840) then neither work well, and text is from zero, like handwriting. (i.e. [2]) for tables, and math equations, neither work well, so we have to do by hand. google OCR is slower as basic loads on opening the new page, and for google you have to press the button. (this is English Wikisource) Slowking4 (talk) 23:35, 3 March 2021 (UTC) Reply
@Jayantanth: Thank you for sharing this! Can you provide more information on what is not good about the OCR? If you have specific examples of the errors or issues you are seeing, that would be very helpful. Thank you! --IFried (WMF) (talk) 17:20, 9 March 2021 (UTC) Reply
  • The ocr output for 2 column or multi column is not as expected. The OCR output is expected column wise. But the the actual output of the ocr is line wise first line from the first column, first line from second column and then second line from first column, 2nd line from second column. What is the desired result is first, second, third, etc from from first column and then followed by second column. In case of dictionaries it is very important. without this column recognition the OCR result is practically useless with so many rearrangements to correct it. Currently we type it directly. Previously in the tool OCR4wikisource (by @Tshrinivasan:), there was a program to specify the number of columns in the image and the tool will spilt up the image vertically and send to the OCR and collect the data sequentially and then give us the desired output. Using this approach in Tamil wikisource we have OCRed many dictionaries with 2 columns. But I don't know the problem the tool now is not supporting the multicolumn. Is there any way to mention the columns in other OCRs like google or Indic OCR? -- Balajijagadesh (talk) 02:34, 27 January 2021 (UTC) Reply
    • @Balajijagadesh: Thank you so much for this feedback! We have been able to reproduce this problem in our own tests, and we understand that this is very frustrating for Wikisource editors. We have created a ticket to analyze this problem and see if there is anything we can do to improve the situation (T277192). Once we have more details on this analysis, we will share it on the project page. Thank you so much for bringing this to our attention! --IFried (WMF) (talk) 17:45, 12 March 2021 (UTC) Reply
  • I have often issues with OCR with PDF files. E.g. this file was uploaded in 2019 and still is not possible to use OCR button. Google OCR works here.

And one more problem is with dictionary - OCR services probably use some dictionary and tries correct words according it, but some words in older texts are always replace with different modern (in old czech text v pravo (on the right), in modern language is vpravo so this word is not in dictionary, and OCR always (99%) writes v právo (to the law)). JAn Dudík (talk) 15:24, 28 January 2021 (UTC) Reply

    • @JAn Dudík: Thank you so much for this feedback! During the course of this project, we will probably be focusing on improving Google/Wikimedia OCR more extensively than other OCR tools (since it is what we worked on in the past). However, if you are experiencing issues with Basic OCR, perhaps you can report it in Phabricator so we can have it documented. In that case, is there a ticket for this issue? As for the issue with correcting words, we understand that this can be very frustrating. Unfortunately, it is out of scope for this project, since we won’t be focusing on improving the actual rendering of text itself, but rather we will focus on improving the efficiency and reliability of the OCR tools. Again, thank you! --IFried (WMF) (talk) 17:47, 12 March 2021 (UTC) Reply
  • The most frustrating was when I didn't know, during 1-2 months, if the existing OCR button (the native button), was really the basic one because it didn't work, which forced me to find an alternative → google OCR. One of the other problem is the use of the OCR tool on the columns, which really doesn't work well — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
    • @Koreller: Thank you so much for sharing this! Both of the issues you described -- i.e., not knowing which OCR tool to use, and experiencing issues with multiple-columned texts -- are also issues that we have identified and will be exploring as a team. We have written about some of these issues in our first project update, but we will provide more information on them as we dig deeper into the project. In that case, please stay tuned and thank you again for your feedback! --IFried (WMF) (talk) 17:43, 19 March 2021 (UTC) Reply
  • I would like to get optionally hOCR instead of text, since its text structure can be fixed using text fragments absolute coordinates; hOCR could be used too to get some formatting suggestions (font-size, indents, centered text....) by local jQuery scripts. --Alex brollo (talk) 09:18, 3 March 2021 (UTC) Reply
  • It would be nice to have OCR just working normally on languages for which it doesn't have a dictionnary to refer to. One Neapolitan wikisource it is a disaster, that becomes worst when dealing with texts from 18th century or before. --Ruthven (msg) 19:42, 3 March 2021 (UTC) Reply
    • @Ruthven: Thank you for providing this feedback! If we understand correctly, you are requesting the ability to have OCR tools not automatically determine the language, since the automatic choice is sometimes incorrect. Is that what you are saying? Once we have more information, we can look into what may be appropriate next steps. Thank you! --IFried (WMF) (talk) 16:29, 30 March 2021 (UTC) Reply
    @IFried (WMF): I dunno the technical details behind an OCR software, but yes, it would be useful to teach the OCR the dictionnary of a specific language. If it selects Italian for Neapolitan texts, 70% of the words will be erroneous. This means that 30% of the words will be correct. But it isn't more precise to just recognise single characters instead of complete expressions of a given language in this case? --Ruthven (msg) 19:40, 2 April 2021 (UTC) Reply
  • Google OCR does not join lines together as required in Wikisource but leaves them separate, which causes problems with further formatting and the lines have to be joined manually or using some other tool added to local commons.js, which is not very friendly to unexperienced newcomers. --Jan Kameníček (talk) 10:16, 9 March 2021 (UTC) Reply

Which problems, overall, do you find the most critical to fix, and why?

Latest comment: 3 years ago 2 comments2 people in discussion
  • I have read above that the tech team aims at improving Google OCR tool in this project. The biggest problem I can see with this tool is that it has to be specially switched on in preferences and is not available by default. That is not a problem for experienced users, but newcomers usually do not learn about its existence in the beginning at all. I was told that the privacy policy doesn't currently allow us to turn it on by default. However, we do need an excellent OCR tool that can be turned on by default because only such a tool can help to keep the newcomers in the project. --Jan Kameníček (talk) 10:16, 9 March 2021 (UTC) Reply
    • @Jan.Kamenicek: Thank you for sharing this! We completely agree that, ideally, there should be an OCR tool turned on by default. This would be much better behavior for both newcomers and all participants in the project. We will investigate if this is possible in the project. Also, thank you for bringing up that there may be privacy policy issues. We will investigate this question as well. Thank you again, and we will provide more information on this topic in future project updates! --IFried (WMF) (talk) 16:33, 30 March 2021 (UTC) Reply

Anything else you would like to add?

Latest comment: 3 years ago 16 comments6 people in discussion
  • I think it would be useful if the new OCR tool will continue to expose an API endpoint (link phetools or google-ocr) to fetch the OCR for a page, so users could build tools around it to automate tasks.
  • Providing an optional way to indicate the OCR's certainty level would be great. For example, allowing editors to switch on highlighting for the parts of the text that the OCR is less sure about. This would be a great way to allow editors to concentrate on the parts of the text that the OCR had problems with, quickly checking these parts and making appropriate fixes as necessary. --Yodin T 13:05, 31 January 2021 (UTC) Reply
    • @Yodin: Thank you so much for providing this feedback! Regarding your first point, this is a great idea, and we’ll discuss with the team if it is possible. We have created a ticket for this work (T278444). Please feel free to add any relevant details to that ticket. Regarding your second point, this is a very exciting idea, but it may be too large in scope for the team to take on. For both of these ideas, we will discuss them as a team, and if there are any updates, we will share them on the project page. Thank you! --IFried (WMF) (talk) 16:46, 30 March 2021 (UTC) Reply
  • Recognition on texts with columns is rarely good, maybe there should be a system to choose several areas manually from the image to be recognized, that would facilitate this recognition. — Koreller (talk) 17:55, 18 February 2021 (UTC) Reply
  • The existing tool (and also the google-ocr) fail when the text is set in columns. Even if it is not often it is quite annoying since this kind of text is also quite tough for many external programs. Here a few examles of pages sharing this layout: , , and Please, notice the different number of columns, rulers or empty space between columns and a header. So it would be nice if the coming tool could cope with these problematic pages. Draco flavus (talk) 19:28, 2 March 2021 (UTC) Reply
    • @Koreller and Draco flavus: Thank you so much for this feedback! We completely agree that this is a major issue. For this reason, we have created a ticket (T277192), which the engineers have already begun to investigate. We hope to see that there is some improvement that we can make on this front. Thank you again! --IFried (WMF) (talk) 16:50, 30 March 2021 (UTC) Reply
  • Hello, the first thing I would like to add is a big THANKS. This document helped me greatly. Actually, even better than it, it helped me to find a solution for someone which asked me for some help, when I wasn't sure something with a low enough technical entry ticket was existing that I could provide as an answer. The problem I was asked for was how to improve transcription of pages from Catalogue de l'histoire de l'Amérique, V like this one. The issue as you can see is that the default OCR doesn't provide anything useful. The document was hosted on the French Wikisource. After a quick search for "OCR" on this same wiki and going through a few links, I landed here. I started to read it, and arrived at the Google OCR section, which made me rush to the Gadget section of my preferences on the French Wikisource. But it was nowhere to be found. I read the documentation more carefully and saw that it seemed to be only available on the multilingual Wikisource. Luckily, this book is actually multilingual. So I re-indexed the first volume there, enabled the gadget, and here is the result of recognition of the same page : now that's really something that is going to help my friend. So, my recommendation I guess is obvious: provide the Google OCR gadget on all languages where it can give better result than the default one. If it's most of the time everywhere, just switch the default and let the other one as a gadget option (you never know...). Note that I didn't tried the last option : I have no doubt I will be able to deploy it, but that is not an option for my friend to work autonomously. So a second suggestion, provide the OCR4Wikisource as a toolforge service, where people just give the URL of the index page, and the tool make all the work using some bot account. Not everyone will be at ease with a command line, but anyone can copy paste an URL and click "Go!". I admit that should be done with a bit more reflections on "what could go wrong?", but basically that's it. As for Indic OCR, I didn't try it either, but if it works better in some case, just make it the default where relevant, and keep it as an optional Gadget everywhere else. Here are what I would find fine:
    • all Wikisource page edition comes by default with a single OCR button, which is set to what works best for the current language according to experience
    • all Wikisource allow to optionally enable the same set of OCR
    • the index page have a "bulk OCR" drop down button that open a wizard which allow to preview samples depending on which OCR tool option you select before validating to launch the job. --Psychoslave (talk) 21:29, 5 March 2021 (UTC) Reply
@Psychoslave: Wow, thank you for such lovely and thorough feedback! We are so happy that the project page provided helpful context for you as well. Regarding your specific recommendations, here are some thoughts:
  • Provide Google OCR gadget on all languages: Yes, we also agree that there should be some default OCR that is accessible to all users, without an installation process required. We have a ticket to look into this (T275547), and we also write different or more tickets on this in the future. The ultimate goal will be make a default OCR tool available.
  • OCR4Wikisource as a toolforge service: Thank you for this wonderful idea! We have one question for you: Would you still want this feature if we created a default way to access bulk OCR support within Wikisource? We are asking because we currently have an investigation lined up to try to make this possible (T277768).
  • All Wikisource allow to optionally enable the same set of OCR: Can you clarify what you mean by this?
  • The index page have a "bulk OCR": As written above, this is a big priority for us, so we plan to investigate this very soon (T277768).
Thank you, and we look forward to your response! --IFried (WMF) (talk) 16:56, 30 March 2021 (UTC) Reply
Would you still want this feature if we created a default way to access bulk OCR support within Wikisource?
I think that would be even far better in term of integration. So if you manage to make that happen successfully, all congrats in advance. On the other, good integration is a far more tough challenge than a compatible external tool. The later is also more aligned with a more distributed platform, as opposed to a single monolithic application. My impression is that WMF is trying to lead the platform to something composed of more autonomous software component, but you will be certainly better informed than me on that point. So, it's all a question of trade-off. Sure for end users, better integration in a single interface seems more appealing as it create a less complex environment to grab in order to contribute. On the other hand, that might bring the platform to a more tied state of affairs on the software development side. Sure that last point is not necessarily a fatality: software architecture/good-practices and tools can prevent a lot of wild interferences. Moreover, if you have the resources to do both, why chose? Of course, in our finite world, most of the time we don't have infinite resources to make all options maintained in parallel indefinitely. You can however evaluate how easy it will be to implement the independent service, and if it is far more likely that it will be a far easier task to achieve, implement that first. And then, as you already have your minimal viable product, you can consider to throw more resources on the more desirable well integrated solution. Well, that's all high level strategies on implementation process, and probably all things you are perfectly aware of, but sometime when we are in real action conditions it's harder to step back.
All Wikisource allow to optionally enable the same set of OCR
Can you clarify what you mean by this?
I think that simply it was in misconception that it was not the case. I should check that again to see if I can indeed achieve the same kind of various OCR calls, whatever the language version I try.
Thank you for your feedbacks, and it's always a pleasure to help with a few ideas if it can help to improve the environment for all contributors. Cheers, Psychoslave (talk) 14:26, 7 April 2021 (UTC) Reply
  • Besides better OCR, we need better quality images. No matter how good the OCR is, the ultimate correction will depend on having texts that are easy to read. The current approach of extracting images from either PDF or DJVU files on Wikisource is fundamentally flawed because both rely on heavily downgraded images. Such images are also not suitable for the usage of crop tools.
    • Support full quality JP2 or PNG convert from the JP2 on IA.
    • Automatically redo OCR periodically if the page has not been proofread or the text does not come from a merge-and-split
    • Support hOCR or ALTO to provide some formatting
    • Discuss the way of reintegrating corrected text back into the original text. This is a future request.

--Languageseeker (talk) 03:44, 9 March 2021 (UTC) Reply

@Languageseeker: Thank you so much for this feedback! Can you provide some specific examples of low quality images impacting the OCR, which we can analyze? As for the comment on automatically updating the OCR-ed pages that have not been proofread, this is an interesting idea and we understand that it could be valuable. We will discuss this as a team. As for the hOCR/ALTO comment, we have also heard that feedback from other Wikisource community members. We have written a ticket for this (T278839) and will discuss it as a team. Finally, regarding the reintegration of corrected text comment: This is a great idea! We don’t know if we can do this, since it is out of the specific focus area of this project. But thank you for bringing this up, and we hope someone can take up this feature request after we improve the OCR tools. --IFried (WMF) (talk) 17:01, 30 March 2021 (UTC) Reply
  • Rather than a single language, it’s better to use a range of characters because there are many books that mix languages. A book can contain Hebrew, French, and English all at once. Just proofreading for English will leave considerable work for volunteers. Languageseeker (talk) 19:08, 10 March 2021 (UTC) Reply
@Languageseeker: Thank you for providing this feedback! Can you provide some specific examples of books with multiple languages that are difficult to properly OCR? Once we have these examples, we can analyze them. Thank you! --IFried (WMF) (talk) 17:06, 30 March 2021 (UTC) Reply
@Fried (WMF): I think that this book is the ultimate test because it contains text in 500 languages. [3]
  • For the column, it's probably better to develop a tool that can split pages, run the ocr on the individual columns, allow for proofreading the individual column, and then automatically reassemble the transcribed text. Not only will this make OCRing easier, it will also make proofreading easier. Languageseeker (talk) 02:00, 12 March 2021 (UTC) Reply
@Languageseeker: Thank you for providing this feedback! We are investigating the multiple column issue (T277192) right now, since we agree this is a major issue to address. We’ll provide an update on the project page when we know more. Thanks! --IFried (WMF) (talk) 17:07, 30 March 2021 (UTC) Reply
  • It occurs to me that effective OCR has been hampered by a series of conscious and unconscious biases that have crept into the software.
    1. Conscious: A text has a single-specific language. In fact, printers did not think of texts in the form of specific languages by rather than specific pieces of type. Any text can have multiple languages, but these are just collection of type pieces. When OCR was developed, it ignored this historic reality because it focused on digitizing documents for professionals.
    2. Unconscious: There is a universal algorithm to recognize a particular character. If this algorithm can be perfected, then OCR will have 100% accuracy. In fact, letters can look extremely different. Take the letter a and look at 15th century texts, 17th century texts, black face, and Fractur. They all look remarkable different. No algorithm can actually capture them. The current algorithms are biased towards how letters appeared at the end of the twentieth century.
    3. Conscious: OCR should not include human intervention. In fact, through proofreading, we're already introducing human intervention after the fact.
    4. Conscious: Confidence level should be hidden from the human. They're statistical curiosities at best. In fact, highlighting low-confidence characters in a visual form can help guide proofreading.
  • To develop a better proofreading, I think it's important to review two basic facts from how a text was created.
    1. Typesetters set a book one character at a time. Therefore, the character and not the word is the basic unit.
    2. Typesetters had to have a consistent visual look. Therefore, they had bins fill with characters that were visually identification. Furthermore, these characters were grouped into typefaces that attempted to provide visual consistency across. These were called typefaces. Following from this, there is a set number of possible representations of any character.
  • To think of achieving OCR, what if, instead of trying to guess, what a particular character is, we try to reverse the typesetting the process? That is, we take the typeset text, break it down into individual characters, group into bins, and ask a human to label the bin. I imagine that it would go something like this:
    1. Take a book.
    2. Separate it into individual character. Mark each character with an individual tag.
    3. Run an algorithm to group similar characters. The computer won't know what the character are, just that there there is an Image_Group_1 with 5,000 characters that have the same appearance.
    4. Make a grid where one image from each character group is displayed. A person can then label the image character with both the machine character and any formatting. For example, this image represent a lower-case italic i. Therefore, Image_Group_1 represents an italic i. Defective characters would need to be labelled as variants.
    5. The OCR would then replace all instance of Image_Group_1 with ''i''
    6. For the characters that are on the margins of the character group, the OCR should also tag them to make them visually easy to see for the proofreading.

This first identification would become the basis for a raster font that can then be used on other books. Over time, we would build a library of raster font files to use for comparison. For instance, we would have a raster fraktur font, a raster Garamond font, a raster Petrus Caesaris and Johannes Stol Type 1:109R [4]. Over time, the need for human intervention will decrease as we teach the OCR program which images correspond to which character. This would be a variant of Matrix matching relying on the fact that no new font faces will be added from the past and that comparing images is much faster now than in the 1980s/1990s. After we group all the characters in a book, we'll probably have around 200-400 groups to compare to a library of several thousand identified characters. We also won't have to worry about language because we'll just be comparing images. To get more information about a particular character, we can initially feed the OCR software books with sample type. Lastly, this can also help to add in other type features such as horizontal lines that the program would recognize as a just another character Languageseeker (talk) 04:28, 13 March 2021 (UTC) Reply

@Languageseeker: Thank you so much for this thorough analysis and recommendation! It’s very exciting to see people think through how OCR tools can be fixed and improved from the ground up. Unfortunately, we will not have the capacity or resources to build a new OCR tool in this project, and we will not have the resources to write out a new OCR algorithm either. However, there is still a lot of other work we can do to improve the existing tools, including some of the work you have suggested in previous comments. Thank you again for this feedback and we hope to hear more from you after we share our next update! --IFried (WMF) (talk) 17:08, 30 March 2021 (UTC) Reply

Current version is very good but some tweaks are needed.

Latest comment: 3 years ago 1 comment1 person in discussion

The current version of the OCR in en.wikisource works well. It recognizes columns and separates paragraphs. But, it cannot recognize double quotes properly and displays them as two single quotes, or just a single quote, most of the time.

Done extensive tests comparing Google OCR and our own which is better overall. Speed of OCR reproduction is not relevant in Wikisource when proofreading a page.Ineuw (talk) 20:17, 15 April 2021 (UTC) Reply

AltStyle によって変換されたページ (->オリジナル) /