[Gambas-user] Text paragraph format
Martin
mbelmonte at belmotek.net
Mon Apr 19 18:09:12 CEST 2021
Hello everybody.
It turns out that the text that comes out of a PDF is not formatted
correctly, but some things can be done to improve it.
If for example a string ends in ".\n" (dot+new-line) we are almost
certainly at the end of a paragraph.
But then there are situations where it is a title and the line of text
does not end with "." (dot) but luckily it happens that then the next
line begins with a capital letter.
Well now the question:
What is the regular expression, or another way to replace \n# with \n[::
jump ::]#? where # is any uppercase letter.
Replace "\n#" > "\n[:: jump ::]#"
Note: I use this gambas code {xpdf[i].GetText(0, 0, xpdf[i].W,
xpdf[i].H)} to extract the page text from a PDF my intention is produce
a very basic HTML document. If some one has any idea related please comment.
Regards.
Martin.
More information about the User
mailing list