[Gambas-user] Enhance image for OCR pure gambas code.

Mon Mar 22 11:03:52 CET 2021

Hi Martin,
Can you confirm if you are trying read text from a PDF OR is it an image 
containing text embedded within a PDF OR you were going to convert a PDF 
to an image?
If you are trying to read text from a PDF unless the PDF was created 
from a graphic there are PDF tools for extracting the text from PDF.
If the text in a PDF was created from an embedded graphic then it will 
depend on the dpi used for the embedded image. Most for printing 
purposes are usually done with 300dpi but if it's just embedded images 
found from web some are only 72dpi.
I found tesseract needed an image to be at least 300dpi to get 
reasonable results.
If you were talking about converting a PDF to an image then I'm not sure 
that is the best way forward.
Re OCR, I found tesseract very "finicky" and while what it achieves is 
very good while I was trying to use it on numeric and financial data 
found it quite frustrating that essentially you could not rely on it and 
it still required a keen eye or algorithm checks to highlight where 
numbers were not read as expected.
I also found that I got better results from compiling from source the 
most recent tesseract though it still fell short and for my purposes.
I ended up crafting code to actually pixelate areas containing numbers 
and deducing the numbers from pattern matching the pixel black vs white 
with templates I held for each digit.
K.