[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

Wikinaut · 2018-03-29T14:28:27Z

(probably something for #1423)

When I have a manually corrected text version (txt output file), my deep wish is to have an easy way to retrain tesseract, based upon the 1. first tesseract txt output and the related 2. manually corrected txt file.

I remember that I asked this question already some years ago. But now situation has changed (more and other developers, new algorithms).

Do you like the idea? Is it possible?

stweil · 2018-03-29T15:13:05Z

The idea is good, it is possible, but today the way is not easy. So there remains work to be done to make it easy.

Wikinaut · 2018-03-29T15:16:35Z

@stweil Can I help with a donation?

stweil · 2018-03-29T21:22:31Z

See http://gepris.dfg.de/gepris/projekt/394264782 (German).

Wikinaut · 2018-03-29T21:38:49Z

Ich verstehe. Developed a server some years ago (open-source replacement for Finereader appliance) with a scheduler and so on. Users could upload multi-page pdf documents with a web formular upload (no limit, we could ocr >1.000 page-pdfs), and users received an e-mail and a link when their "ocr jobs" were ready.

Wikinaut · 2018-03-29T21:40:18Z

This was when I discovered that tesseract (then) internally re-coded documents with a lossy compression, which is a "NOGO" for CG and something with sharp edges: like fonts, the thing we want to ocr....

amitdo · 2021-09-10T00:21:38Z

As @stweil already said, this feature was implemented years ago (fine tuning a lstm model).

We should not keep this 'issue' open for the 'make it easy' part. PRs to this repo or to the tesstrain repo are welcomed.

Wikinaut mentioned this issue May 25, 2018

Fix some wrong German words (confusion B / ß) tesseract-ocr/langdata#54

Merged

zdenop added feature request training accuracy labels Sep 29, 2018

amitdo closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

Wikinaut commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018 •

edited

Loading

Wikinaut commented Mar 29, 2018 •

edited

Loading

amitdo commented Sep 10, 2021

[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

[Suggestion] "Training light" - Learning by doing - re-feeding a corrected text file to retrain tesseract #1442

Comments

Wikinaut commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018

stweil commented Mar 29, 2018

Wikinaut commented Mar 29, 2018 • edited Loading

Wikinaut commented Mar 29, 2018 • edited Loading

amitdo commented Sep 10, 2021

Wikinaut commented Mar 29, 2018 •

edited

Loading

Wikinaut commented Mar 29, 2018 •

edited

Loading