[Gambas-user] Help needed from regexp gurus

Sat Jun 17 23:06:22 CEST 2017

On Sat, 17 Jun 2017, Fernando Cabral wrote:
> Still beating my head against the wall due to my lack of knowledge about
> the PCRE methods and properties... Because of this, I have progressed not
> only very slowly but also -- I fell -- in a very inelegant way. So perhaps
> you guys who are more acquainted with PCRE might be able to hint me on a
> better solution.
> 
> I want to search a long string that can contain a sentence, a paragraph or
> even a full text. I wanna find and isolate every word it contains. A word
> is defined as any sequence of alphabetic characters followed by a
> non-alphatetic character.
> 

The Mathematician in me can't resist to point this out: you hopefully wanted
to define "word in a string" as "a *longest* sequence of alphabetic characters
followed by a non-alphabetic character (or the end of the string)". Using your
definition above, the words in "abc:" would be "c", "bc" and "abc", whereas
you probably only wanted "abc" (the longest of those).

> The sample code bellow does work, but I don't feel it is as elegant and as
> fast as it could and should be.  Especially the way I am traversing the
> string from the beginning to the end. It looks awkward and slow. There must
> be a more efficient way, like working only with offsets and lengths instead
> of copying the string again and again.
> 

You think worse of String.Mid() than it deserves, IMHO. Gambas strings
are triples of a pointer to some data, a start index and a length, and
the built-in string functions take care not to copy a string when it's
not necessary. The plain Mid$() function (dealing with ASCII strings only)
is implemented as a constant-time operation which simply takes your input
string and adjusts the start index and length to give you the requested
portion of the string. The string doesn't even have to be read, much less
copied, to do this.

Now, the String.Mid() function is somewhat more complicated, because
UTF-8 strings have variable-width characters, which makes it difficult
to map byte indices to character positions. To implement String.Mid(),
your string has to be read, but, again, not copied.

Extracting a part of a string is a non-destructive operation in Gambas
and no copying takes place. (Concatenating strings, on the other hand,
will copy.) So, there is some reading overhead (if you need UTF-8 strings),
but it's smaller than you probably thought.

> Dim Alphabetics as string "abc...zyzABC...ZYZ"
> Dim re as RegExp
> Dim matches as String []
> Dim RawText as String
> 
> re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> RegExp.utf8)
> RawText = "abc12345def ghi jklm mno p1"
> 
> Do While RawText
>      re.Exec(RawText)
>      matches.add(re[1].text)
>      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> Loop
> 
> For i = 0 To matches.Count - 1
>   Print matches[i]
> Next
> 
> 
> Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks I
> have used are cumbersome (like advancing with string.mid() and resorting to
> re[1].text and re.text.
> 

Well, I think you can't use PCRE alone to solve your problem, if you want
to capture a variable number of words in your submatches. I did a bit of
reading and from what I gather [1][2] capturing group numbers are assigned
based on the verbatim regular expression, i.e. the number of submatches
you can receive is limited by the number of "(...)" constructs in your
expression; and the (otherwise very nifty) recursion operator (?R) does
not give you an unlimited number of capturing groups, sadly.

Anyway, I think by changing your regular expression, you can let PCRE take
care of the string advancement, like so:

   1 #!/usr/bin/gbs3
   2
   3 Use "gb.pcre"
   4
   5 Public Sub Main()
   6   Dim r As New RegExp
   7   Dim s As string
   8
   9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
  10   s = "abc12345def ghi jklm mno p1"
  11   Print "Subject:";; s
  12   Do
  13     r.Exec(s)
  14     If r.Offset = -1 Then Break
  15     Print " ->";; r[1].Text
  16     s = r[2].Text
  17   Loop While s
  18 End

Output:

  Subject: abc12345def ghi jklm mno p1
   -> abc
   -> def
   -> ghi
   -> jklm
   -> mno
   -> p

But, I think, this is less efficient than using String.Mid(). The trailing
group (.*$) _may_ make the PCRE library read the entire subject every time.
And I believe gb.pcre will copy your submatch string when returning it.
If you care deeply about this, you'll have to trace the code in gb.pcre
and main/gbx (the interpreter) to see what copies strings and what doesn't.

Regards,
Tobi

[1] http://www.regular-expressions.info/recursecapture.html (Capturing Groups Inside Recursion or Subroutine Calls)
[2] http://www.rexegg.com/regex-recursion.html (Groups Contents and Numbering in Recursive Expressions)

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk