[Gambas-user] Help needed from regexp gurus

Sun Jun 18 01:34:15 CEST 2017

Thank you, Tobi, for taking the time to comment on my issues. I will ponder
the following.

2017-06-17 18:06 GMT-03:00 Tobias Boege <taboege at ...626...>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> >> Still beating my head against the wall due to my lack of knowledge about
> >> the PCRE methods and properties... Because of this, I have progressed
> not
> >> only very slowly but also -- I fell -- in a very inelegant way. So
> perhaps
> >> you guys who are more acquainted with PCRE might be able to hint me on
> a
> >> better solution.
> >>
> >> I want to search a long string that can contain a sentence, a
> paragraph or
> >> even a full text. I wanna find and isolate every word it contains. A
> word
> >> is defined as any sequence of alphabetic characters followed by a
> >> non-alphatetic character.
>
> >The Mathematician in me can't resist to point this out: you hopefully
> wanted
> >to define "word in a string" as "a *longest* sequence of alphabetic
> characters
> >followed by a non-alphabetic character (or the end of the string)".
> Using your
> >definition above, the words in "abc:" would be "c", "bc" and "abc",
> whereas
> >you probably only wanted "abc" (the longest of those).
>
> Right, the longest sequence. But I can't see why my definition is not
equivalent to yours, even thou
it is simpler. "A word is defined as any sequence of alphabetic characters
followed by a non-alphabetic character" has to be the longest, no matter
what. See, in "abc", "a" and "ab" are not followed by a non-alphabetic, so
you have to keep advancing. "abc" is followed by a non-alphabetic, so it
will comply with the definition.

So I think we can do without stating it has to be the longest sequence. If
I am wrong, I still can' t see why.

> >> The sample code bellow does work, but I don't feel it is as elegant and
> as
> >> fast as it could and should be.  Especially the way I am traversing the
> >> string from the beginning to the end. It looks awkward and slow. There
> must
> >> be a more efficient way, like working only with offsets and lengths
> instead
> >> of copying the string again and again.
>
> >You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> >are triples of a pointer to some data, a start index and a length, and
> >the built-in string functions take care not to copy a string when it's
> >not necessary. The plain Mid$() function (dealing with ASCII strings only)
> >is implemented as a constant-time operation which simply takes your input
> >string and adjusts the start index and length to give you the requested
> >portion of the string. The string doesn't even have to be read, much less
> >copied, to do this.
>
> >Now, the String.Mid() function is somewhat more complicated, because
> >UTF-8 strings have variable-width characters, which makes it difficult
> >to map byte indices to character positions. To implement String.Mid(),
> >your string has to be read, but, again, not copied.
>
> Right. Since I am workings with Portuguese, it has to be UTF8. So I can't
avoid using
String.Mid().

But I still understand it has to be copied because I am doing a

str = String.Mid(str, HowMany)

In this case I would guess it has to be copied because the original
contents is shrunk, which
happens again and again, until nothing is left to be scanned. I understand
Gambas does not do
garbage collection as old basic used to do, but still, I suppose it
eventually will have to recover
unused memory.

> > Extracting a part of a string is a non-destructive operation in Gambas
> > and no copying takes place. (Concatenating strings, on the other hand,
> > will copy.) So, there is some reading overhead (if you need UTF-8
> strings),
> > but it's smaller than you probably thought.
>
> As per above, in this case it is not only extracting, but overwriting the
contents itself.

> > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > Dim re as RegExp
> > Dim matches as String []
> > Dim RawText as String
> >
> > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > RegExp.utf8)
> > RawText = "abc12345def ghi jklm mno p1"
> >
> > Do While RawText
> >      re.Exec(RawText)
> >      matches.add(re[1].text)
> >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > Loop
> >
> > For i = 0 To matches.Count - 1
> >   Print matches[i]
> > Next
> >
> >
> > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks
> I
> > have used are cumbersome (like advancing with string.mid() and resorting
> to
> > re[1].text and re.text.
> >
>
> >Well, I think you can't use PCRE alone to solve your problem, if you want
> >to capture a variable number of words in your submatches. I did a bit of
> >reading and from what I gather [1][2] capturing group numbers are
> assigned
> >based on the verbatim regular expression, i.e. the number of submatches
> >you can receive is limited by the number of "(...)" constructs in your
> >expression; and the (otherwise very nifty) recursion operator (?R) does
> >not give you an unlimited number of capturing groups, sadly.
>

What I need is to grab a word at a time. The reason I am using two
submatches
"([:Alpha:])([:^Alpha:])" is because I don't care for Non-Alpha. This way I
can
I can forget about the submatch, but it will help me to skip to the next
word (since len(re.text)
complises the lenght of both submatches).

>
> > Anyway, I think by changing your regular expression, you can let PCRE
> take
> > care of the string advancement, like so:
>

For the time being, I will use the loop the way you proposed bellow. It
seems cleaner than
my solution. As to the performance, latter I'll check which one is faster.

Thanks a lot

- fernando

>
>    1 #!/usr/bin/gbs3
>    2
>    3 Use "gb.pcre"
>    4
>    5 Public Sub Main()
>    6   Dim r As New RegExp
>    7   Dim s As string
>    8
>    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>   10   s = "abc12345def ghi jklm mno p1"
>   11   Print "Subject:";; s
>   12   Do
>   13     r.Exec(s)
>   14     If r.Offset = -1 Then Break
>   15     Print " ->";; r[1].Text
>   16     s = r[2].Text
>   17   Loop While s
>   18 End
>
> Output:
>
>   Subject: abc12345def ghi jklm mno p1
>    -> abc
>    -> def
>    -> ghi
>    -> jklm
>    -> mno
>    -> p
>
> But, I think, this is less efficient than using String.Mid(). The trailing
> group (.*$) _may_ make the PCRE library read the entire subject every time.
> And I believe gb.pcre will copy your submatch string when returning it.
> If you care deeply about this, you'll have to trace the code in gb.pcre
> and main/gbx (the interpreter) to see what copies strings and what doesn't.
>
> Regards,
> Tobi
>
> [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> Groups Inside Recursion or Subroutine Calls)
> [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> Numbering in Recursive Expressions)
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>

-- 
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: fernandojosecabral at ...626...
Facebook: f at ...3654...
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.