[Gambas-user] Help needed from regexp gurus

Jussi Lahtinen jussi.lahtinen at ...626...
Sun Jun 18 02:21:29 CEST 2017


I think I would do something like:

  Dim ii As Integer
  Dim sStr As String = "abc defg hijkl"
  Dim sWords As String[]

  sWords = Split(sStr, " ")

  For ii = 0 To 2
   Print sWords[ii]
  Next




Jussi

On Sun, Jun 18, 2017 at 2:57 AM, Fernando Cabral <
fernandojosecabral at ...626...> wrote:

> Tobi
>
> One more thing about the way I wish it could work (I remember having done
> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> schematic, but I think it will clarify the issue.
>
> Let p and l be arrays of integers and s be the string "abc defg hijkl"
>
> So, after traversing the string we would have the following result:
> p[0] = offset of "a" (0)
> l[0] = length of "abc" (3)
> p[1] = offset of "d" (4)
> l[1] = lenght of "defg" (4)
> p[2] = offset of "h" (9)
> l[2] = lenght of "hijkl" (5).
>
> After this, each word could be retrieved in the following manner:
>
> for i = 0 to 2
>     print mid(s, p[i], l[i])
> next
>
> I think this would be the most efficient way to do it. But I can't find how
> to do it in Gambas using Regex.
>
> Regards
>
> - fernando
>
>
> 2017-06-17 18:06 GMT-03:00 Tobias Boege <taboege at ...626...>:
>
> > On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > > Still beating my head against the wall due to my lack of knowledge
> about
> > > the PCRE methods and properties... Because of this, I have progressed
> not
> > > only very slowly but also -- I fell -- in a very inelegant way. So
> > perhaps
> > > you guys who are more acquainted with PCRE might be able to hint me on
> a
> > > better solution.
> > >
> > > I want to search a long string that can contain a sentence, a paragraph
> > or
> > > even a full text. I wanna find and isolate every word it contains. A
> word
> > > is defined as any sequence of alphabetic characters followed by a
> > > non-alphatetic character.
> > >
> >
> > The Mathematician in me can't resist to point this out: you hopefully
> > wanted
> > to define "word in a string" as "a *longest* sequence of alphabetic
> > characters
> > followed by a non-alphabetic character (or the end of the string)". Using
> > your
> > definition above, the words in "abc:" would be "c", "bc" and "abc",
> whereas
> > you probably only wanted "abc" (the longest of those).
> >
> > > The sample code bellow does work, but I don't feel it is as elegant and
> > as
> > > fast as it could and should be.  Especially the way I am traversing the
> > > string from the beginning to the end. It looks awkward and slow. There
> > must
> > > be a more efficient way, like working only with offsets and lengths
> > instead
> > > of copying the string again and again.
> > >
> >
> > You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> > are triples of a pointer to some data, a start index and a length, and
> > the built-in string functions take care not to copy a string when it's
> > not necessary. The plain Mid$() function (dealing with ASCII strings
> only)
> > is implemented as a constant-time operation which simply takes your input
> > string and adjusts the start index and length to give you the requested
> > portion of the string. The string doesn't even have to be read, much less
> > copied, to do this.
> >
> > Now, the String.Mid() function is somewhat more complicated, because
> > UTF-8 strings have variable-width characters, which makes it difficult
> > to map byte indices to character positions. To implement String.Mid(),
> > your string has to be read, but, again, not copied.
> >
> > Extracting a part of a string is a non-destructive operation in Gambas
> > and no copying takes place. (Concatenating strings, on the other hand,
> > will copy.) So, there is some reading overhead (if you need UTF-8
> strings),
> > but it's smaller than you probably thought.
> >
> > > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > > Dim re as RegExp
> > > Dim matches as String []
> > > Dim RawText as String
> > >
> > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > > RegExp.utf8)
> > > RawText = "abc12345def ghi jklm mno p1"
> > >
> > > Do While RawText
> > >      re.Exec(RawText)
> > >      matches.add(re[1].text)
> > >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > > Loop
> > >
> > > For i = 0 To matches.Count - 1
> > >   Print matches[i]
> > > Next
> > >
> > >
> > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the
> tricks
> > I
> > > have used are cumbersome (like advancing with string.mid() and
> resorting
> > to
> > > re[1].text and re.text.
> > >
> >
> > Well, I think you can't use PCRE alone to solve your problem, if you want
> > to capture a variable number of words in your submatches. I did a bit of
> > reading and from what I gather [1][2] capturing group numbers are
> assigned
> > based on the verbatim regular expression, i.e. the number of submatches
> > you can receive is limited by the number of "(...)" constructs in your
> > expression; and the (otherwise very nifty) recursion operator (?R) does
> > not give you an unlimited number of capturing groups, sadly.
> >
> > Anyway, I think by changing your regular expression, you can let PCRE
> take
> > care of the string advancement, like so:
> >
> >    1 #!/usr/bin/gbs3
> >    2
> >    3 Use "gb.pcre"
> >    4
> >    5 Public Sub Main()
> >    6   Dim r As New RegExp
> >    7   Dim s As string
> >    8
> >    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
> >   10   s = "abc12345def ghi jklm mno p1"
> >   11   Print "Subject:";; s
> >   12   Do
> >   13     r.Exec(s)
> >   14     If r.Offset = -1 Then Break
> >   15     Print " ->";; r[1].Text
> >   16     s = r[2].Text
> >   17   Loop While s
> >   18 End
> >
> > Output:
> >
> >   Subject: abc12345def ghi jklm mno p1
> >    -> abc
> >    -> def
> >    -> ghi
> >    -> jklm
> >    -> mno
> >    -> p
> >
> > But, I think, this is less efficient than using String.Mid(). The
> trailing
> > group (.*$) _may_ make the PCRE library read the entire subject every
> time.
> > And I believe gb.pcre will copy your submatch string when returning it.
> > If you care deeply about this, you'll have to trace the code in gb.pcre
> > and main/gbx (the interpreter) to see what copies strings and what
> doesn't.
> >
> > Regards,
> > Tobi
> >
> > [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> > Groups Inside Recursion or Subroutine Calls)
> > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> > Numbering in Recursive Expressions)
> >
> > --
> > "There's an old saying: Don't change anything... ever!" -- Mr. Monk
> >
> > ------------------------------------------------------------
> > ------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > _______________________________________________
> > Gambas-user mailing list
> > Gambas-user at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/gambas-user
> >
>
>
>
> --
> Fernando Cabral
> Blogue: http://fernandocabral.org
> Twitter: http://twitter.com/fjcabral
> e-mail: fernandojosecabral at ...626...
> Facebook: f at ...3654...
> Telegram: +55 (37) 99988-8868
> Wickr ID: fernandocabral
> WhatsApp: +55 (37) 99988-8868
> Skype:  fernandojosecabral
> Telefone fixo: +55 (37) 3521-2183
> Telefone celular: +55 (37) 99988-8868
>
> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
> nenhum político ou cientista poderá se gabar de nada.
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



More information about the User mailing list