[Gambas-user] Text paragraph format

Martin mbelmonte at belmotek.net
Mon Apr 19 18:09:12 CEST 2021


Hello everybody.
It turns out that the text that comes out of a PDF is not formatted 
correctly, but some things can be done to improve it.
If for example a string ends in ".\n" (dot+new-line) we are almost 
certainly at the end of a paragraph.
But then there are situations where it is a title and the line of text 
does not end with "." (dot) but luckily it happens that then the next 
line begins with a capital letter.

Well now the question:

What is the regular expression, or another way to replace \n# with \n[:: 
jump ::]#? where # is any uppercase letter.

Replace "\n#"    >  "\n[:: jump ::]#"

Note: I use this gambas code {xpdf[i].GetText(0, 0, xpdf[i].W, 
xpdf[i].H)} to extract the page text from a PDF my intention is produce 
a very basic HTML document. If some one has any idea related please comment.

Regards.

Martin.



More information about the User mailing list