[Gambas-user] Help needed from regexp gurus
Fernando Cabral
fernandojosecabral at ...626...
Mon Jun 19 21:19:41 CEST 2017
This is mostly to thank Tobi and Jussi for their help in solving some
issues that were making me unhappy.
With three lines of code I have solved what used to take me twenty or so.
What is better yet: execution time
fell down from 2min 30 sec to 1,5 seconds. And the code is much more
transparent, The three lines bellow are the heart and brain of the program:
MatchedWords = Split(RawText, " \"'`[]{}+-_:#$%&.!?:(),;-\n", "", True)
MatchedSentences = Split(RegExp.Replace(RawText,
"([.])|([!])|([?])|(;\n)|([:]\n)", "&1&2&3&4&5\x00", RegExp.UTF8), "\x00",
"", True)
MatchedParagraphs = Split(RawText, "\n", "", True)
These three lines will take an entire text file (read into the variable
RawText) and split it into words, sentences and paragraphs. They ONE SECOND
to process a 150-page long text file with 414,961 bytes, tallying 69,196
words, 4,626 sentences and 2,409 paragraphs.
I am impressed!
In this last (and fast) version I have depended very little on RegExp. But
I still have used it to do some massaging on the original text. The line
"MatchedSentences = ...." above shows an example. The characters ".?!",
and the strings ";\n" and ":\n" signal the end of a sentence. Nevertheless,
I can not use them as separators for Split(). I can' t because Split ()
would drop them as it does with separators. Nevertheless, I need them
later. So I used RegExpl.Replace () to insert a \x00 after each of them and
then I used \x00 as the only sentence separator. This preserved the
punctuation marks I needed at the end of each sentence.
After running those three lines I still need to do some additional
processing with the resulting arrays, but that only consumes another half a
second for the same 150-page long document.
Now I am happy and I feel stimulated to complete the code and do some
polishing.
Thank you, Tobi and Jussi. You have helped a lot.
2017-06-17 18:06 GMT-03:00 Tobias Boege <taboege at ...626...>:
> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > Still beating my head against the wall due to my lack of knowledge about
> > the PCRE methods and properties... Because of this, I have progressed not
> > only very slowly but also -- I fell -- in a very inelegant way. So
> perhaps
> > you guys who are more acquainted with PCRE might be able to hint me on a
> > better solution.
> >
> > I want to search a long string that can contain a sentence, a paragraph
> or
> > even a full text. I wanna find and isolate every word it contains. A word
> > is defined as any sequence of alphabetic characters followed by a
> > non-alphatetic character.
> >
>
> The Mathematician in me can't resist to point this out: you hopefully
> wanted
> to define "word in a string" as "a *longest* sequence of alphabetic
> characters
> followed by a non-alphabetic character (or the end of the string)". Using
> your
> definition above, the words in "abc:" would be "c", "bc" and "abc", whereas
> you probably only wanted "abc" (the longest of those).
>
> > The sample code bellow does work, but I don't feel it is as elegant and
> as
> > fast as it could and should be. Especially the way I am traversing the
> > string from the beginning to the end. It looks awkward and slow. There
> must
> > be a more efficient way, like working only with offsets and lengths
> instead
> > of copying the string again and again.
> >
>
> You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> are triples of a pointer to some data, a start index and a length, and
> the built-in string functions take care not to copy a string when it's
> not necessary. The plain Mid$() function (dealing with ASCII strings only)
> is implemented as a constant-time operation which simply takes your input
> string and adjusts the start index and length to give you the requested
> portion of the string. The string doesn't even have to be read, much less
> copied, to do this.
>
> Now, the String.Mid() function is somewhat more complicated, because
> UTF-8 strings have variable-width characters, which makes it difficult
> to map byte indices to character positions. To implement String.Mid(),
> your string has to be read, but, again, not copied.
>
> Extracting a part of a string is a non-destructive operation in Gambas
> and no copying takes place. (Concatenating strings, on the other hand,
> will copy.) So, there is some reading overhead (if you need UTF-8 strings),
> but it's smaller than you probably thought.
>
> > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > Dim re as RegExp
> > Dim matches as String []
> > Dim RawText as String
> >
> > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > RegExp.utf8)
> > RawText = "abc12345def ghi jklm mno p1"
> >
> > Do While RawText
> > re.Exec(RawText)
> > matches.add(re[1].text)
> > RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > Loop
> >
> > For i = 0 To matches.Count - 1
> > Print matches[i]
> > Next
> >
> >
> > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks
> I
> > have used are cumbersome (like advancing with string.mid() and resorting
> to
> > re[1].text and re.text.
> >
>
> Well, I think you can't use PCRE alone to solve your problem, if you want
> to capture a variable number of words in your submatches. I did a bit of
> reading and from what I gather [1][2] capturing group numbers are assigned
> based on the verbatim regular expression, i.e. the number of submatches
> you can receive is limited by the number of "(...)" constructs in your
> expression; and the (otherwise very nifty) recursion operator (?R) does
> not give you an unlimited number of capturing groups, sadly.
>
> Anyway, I think by changing your regular expression, you can let PCRE take
> care of the string advancement, like so:
>
> 1 #!/usr/bin/gbs3
> 2
> 3 Use "gb.pcre"
> 4
> 5 Public Sub Main()
> 6 Dim r As New RegExp
> 7 Dim s As string
> 8
> 9 r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
> 10 s = "abc12345def ghi jklm mno p1"
> 11 Print "Subject:";; s
> 12 Do
> 13 r.Exec(s)
> 14 If r.Offset = -1 Then Break
> 15 Print " ->";; r[1].Text
> 16 s = r[2].Text
> 17 Loop While s
> 18 End
>
> Output:
>
> Subject: abc12345def ghi jklm mno p1
> -> abc
> -> def
> -> ghi
> -> jklm
> -> mno
> -> p
>
> But, I think, this is less efficient than using String.Mid(). The trailing
> group (.*$) _may_ make the PCRE library read the entire subject every time.
> And I believe gb.pcre will copy your submatch string when returning it.
> If you care deeply about this, you'll have to trace the code in gb.pcre
> and main/gbx (the interpreter) to see what copies strings and what doesn't.
>
> Regards,
> Tobi
>
> [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> Groups Inside Recursion or Subroutine Calls)
> [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> Numbering in Recursive Expressions)
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>
--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: fernandojosecabral at ...626...
Facebook: f at ...3654...
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype: fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868
Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
More information about the User
mailing list