[Gambas-user] Enhance image for OCR pure gambas code.
KKing
kicking177 at gmail.com
Mon Mar 22 11:03:52 CET 2021
Hi Martin,
Can you confirm if you are trying read text from a PDF OR is it an image
containing text embedded within a PDF OR you were going to convert a PDF
to an image?
If you are trying to read text from a PDF unless the PDF was created
from a graphic there are PDF tools for extracting the text from PDF.
If the text in a PDF was created from an embedded graphic then it will
depend on the dpi used for the embedded image. Most for printing
purposes are usually done with 300dpi but if it's just embedded images
found from web some are only 72dpi.
I found tesseract needed an image to be at least 300dpi to get
reasonable results.
If you were talking about converting a PDF to an image then I'm not sure
that is the best way forward.
Re OCR, I found tesseract very "finicky" and while what it achieves is
very good while I was trying to use it on numeric and financial data
found it quite frustrating that essentially you could not rely on it and
it still required a keen eye or algorithm checks to highlight where
numbers were not read as expected.
I also found that I got better results from compiling from source the
most recent tesseract though it still fell short and for my purposes.
I ended up crafting code to actually pixelate areas containing numbers
and deducing the numbers from pattern matching the pixel black vs white
with templates I held for each digit.
K.
More information about the User
mailing list