[Gambas-user] Help needed from regexp gurus

Jussi Lahtinen jussi.lahtinen at ...626...
Sun Jun 18 05:32:22 CEST 2017


Oh, sorry... this way of course:

  Dim sStr As String = "abc. def!!!     ghi?   jkl:  (mno)"
  Dim sWords As String[]

  sWords = Split(sStr, " .!?:()", "", True) ''Expand as you will.

  For ii = 0 To sWords.Max
   Print sWords[ii]
  Next



Jussi

On Sun, Jun 18, 2017 at 6:29 AM, Jussi Lahtinen <jussi.lahtinen at ...626...>
wrote:

> It's not problem.
>
>   Dim sStr As String = "abc. def!!!     ghi?   jkl:  (mno)"
>   Dim sWords As String[]
>
>   sWords = Split(sStr, " .!?:()") '' Exapand as you will.
>
>   ii = 0
>   Do
>     If sWords[ii] = "" Then
>       sWords.Remove(ii)
>     Else
>       Inc ii
>     Endif
>   Loop Until ii > sWords.Max
>
>   For ii = 0 To sWords.Max
>    Print sWords[ii]
>   Next
>
>
> Jussi
>
>
> On Sun, Jun 18, 2017 at 4:53 AM, Fernando Cabral <
> fernandojosecabral at ...626...> wrote:
>
>> Jussi, what you suggest will not work. You have presumed the only
>> separator is a single space.
>> This is not the case. Between any two words you can have any non-alpha
>> character in any number.
>> It could be, for instance, "abc. def!!!     ghi?   jkl:  (mno)" and so
>> forth.
>> This means, the definition of word is "any sequence of alphabetic
>> characters followed by any sequence of non-alphabetic.
>>
>> That's why your suggestion does not apply.
>>
>> - fernando
>>
>> 2017-06-17 21:21 GMT-03:00 Jussi Lahtinen <jussi.lahtinen at ...626...>:
>>
>>> I think I would do something like:
>>>
>>>   Dim ii As Integer
>>>   Dim sStr As String = "abc defg hijkl"
>>>   Dim sWords As String[]
>>>
>>>   sWords = Split(sStr, " ")
>>>
>>>   For ii = 0 To 2
>>>    Print sWords[ii]
>>>   Next
>>>
>>>
>>>
>>>
>>> Jussi
>>>
>>> On Sun, Jun 18, 2017 at 2:57 AM, Fernando Cabral <
>>> fernandojosecabral at ...626...> wrote:
>>>
>>>> Tobi
>>>>
>>>> One more thing about the way I wish it could work (I remember having
>>>> done
>>>> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
>>>> schematic, but I think it will clarify the issue.
>>>>
>>>> Let p and l be arrays of integers and s be the string "abc defg hijkl"
>>>>
>>>> So, after traversing the string we would have the following result:
>>>> p[0] = offset of "a" (0)
>>>> l[0] = length of "abc" (3)
>>>> p[1] = offset of "d" (4)
>>>> l[1] = lenght of "defg" (4)
>>>> p[2] = offset of "h" (9)
>>>> l[2] = lenght of "hijkl" (5).
>>>>
>>>> After this, each word could be retrieved in the following manner:
>>>>
>>>> for i = 0 to 2
>>>>     print mid(s, p[i], l[i])
>>>> next
>>>>
>>>> I think this would be the most efficient way to do it. But I can't find
>>>> how
>>>> to do it in Gambas using Regex.
>>>>
>>>> Regards
>>>>
>>>> - fernando
>>>>
>>>>
>>>> 2017-06-17 18:06 GMT-03:00 Tobias Boege <taboege at ...626...>:
>>>>
>>>> > On Sat, 17 Jun 2017, Fernando Cabral wrote:
>>>> > > Still beating my head against the wall due to my lack of knowledge
>>>> about
>>>> > > the PCRE methods and properties... Because of this, I have
>>>> progressed not
>>>> > > only very slowly but also -- I fell -- in a very inelegant way. So
>>>> > perhaps
>>>> > > you guys who are more acquainted with PCRE might be able to hint me
>>>> on a
>>>> > > better solution.
>>>> > >
>>>> > > I want to search a long string that can contain a sentence, a
>>>> paragraph
>>>> > or
>>>> > > even a full text. I wanna find and isolate every word it contains.
>>>> A word
>>>> > > is defined as any sequence of alphabetic characters followed by a
>>>> > > non-alphatetic character.
>>>> > >
>>>> >
>>>> > The Mathematician in me can't resist to point this out: you hopefully
>>>> > wanted
>>>> > to define "word in a string" as "a *longest* sequence of alphabetic
>>>> > characters
>>>> > followed by a non-alphabetic character (or the end of the string)".
>>>> Using
>>>> > your
>>>> > definition above, the words in "abc:" would be "c", "bc" and "abc",
>>>> whereas
>>>> > you probably only wanted "abc" (the longest of those).
>>>> >
>>>> > > The sample code bellow does work, but I don't feel it is as elegant
>>>> and
>>>> > as
>>>> > > fast as it could and should be.  Especially the way I am traversing
>>>> the
>>>> > > string from the beginning to the end. It looks awkward and slow.
>>>> There
>>>> > must
>>>> > > be a more efficient way, like working only with offsets and lengths
>>>> > instead
>>>> > > of copying the string again and again.
>>>> > >
>>>> >
>>>> > You think worse of String.Mid() than it deserves, IMHO. Gambas strings
>>>> > are triples of a pointer to some data, a start index and a length, and
>>>> > the built-in string functions take care not to copy a string when it's
>>>> > not necessary. The plain Mid$() function (dealing with ASCII strings
>>>> only)
>>>> > is implemented as a constant-time operation which simply takes your
>>>> input
>>>> > string and adjusts the start index and length to give you the
>>>> requested
>>>> > portion of the string. The string doesn't even have to be read, much
>>>> less
>>>> > copied, to do this.
>>>> >
>>>> > Now, the String.Mid() function is somewhat more complicated, because
>>>> > UTF-8 strings have variable-width characters, which makes it difficult
>>>> > to map byte indices to character positions. To implement String.Mid(),
>>>> > your string has to be read, but, again, not copied.
>>>> >
>>>> > Extracting a part of a string is a non-destructive operation in Gambas
>>>> > and no copying takes place. (Concatenating strings, on the other hand,
>>>> > will copy.) So, there is some reading overhead (if you need UTF-8
>>>> strings),
>>>> > but it's smaller than you probably thought.
>>>> >
>>>> > > Dim Alphabetics as string "abc...zyzABC...ZYZ"
>>>> > > Dim re as RegExp
>>>> > > Dim matches as String []
>>>> > > Dim RawText as String
>>>> > >
>>>> > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
>>>> > > RegExp.utf8)
>>>> > > RawText = "abc12345def ghi jklm mno p1"
>>>> > >
>>>> > > Do While RawText
>>>> > >      re.Exec(RawText)
>>>> > >      matches.add(re[1].text)
>>>> > >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
>>>> > > Loop
>>>> > >
>>>> > > For i = 0 To matches.Count - 1
>>>> > >   Print matches[i]
>>>> > > Next
>>>> > >
>>>> > >
>>>> > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the
>>>> tricks
>>>> > I
>>>> > > have used are cumbersome (like advancing with string.mid() and
>>>> resorting
>>>> > to
>>>> > > re[1].text and re.text.
>>>> > >
>>>> >
>>>> > Well, I think you can't use PCRE alone to solve your problem, if you
>>>> want
>>>> > to capture a variable number of words in your submatches. I did a bit
>>>> of
>>>> > reading and from what I gather [1][2] capturing group numbers are
>>>> assigned
>>>> > based on the verbatim regular expression, i.e. the number of
>>>> submatches
>>>> > you can receive is limited by the number of "(...)" constructs in your
>>>> > expression; and the (otherwise very nifty) recursion operator (?R)
>>>> does
>>>> > not give you an unlimited number of capturing groups, sadly.
>>>> >
>>>> > Anyway, I think by changing your regular expression, you can let PCRE
>>>> take
>>>> > care of the string advancement, like so:
>>>> >
>>>> >    1 #!/usr/bin/gbs3
>>>> >    2
>>>> >    3 Use "gb.pcre"
>>>> >    4
>>>> >    5 Public Sub Main()
>>>> >    6   Dim r As New RegExp
>>>> >    7   Dim s As string
>>>> >    8
>>>> >    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>>>> >   10   s = "abc12345def ghi jklm mno p1"
>>>> >   11   Print "Subject:";; s
>>>> >   12   Do
>>>> >   13     r.Exec(s)
>>>> >   14     If r.Offset = -1 Then Break
>>>> >   15     Print " ->";; r[1].Text
>>>> >   16     s = r[2].Text
>>>> >   17   Loop While s
>>>> >   18 End
>>>> >
>>>> > Output:
>>>> >
>>>> >   Subject: abc12345def ghi jklm mno p1
>>>> >    -> abc
>>>> >    -> def
>>>> >    -> ghi
>>>> >    -> jklm
>>>> >    -> mno
>>>> >    -> p
>>>> >
>>>> > But, I think, this is less efficient than using String.Mid(). The
>>>> trailing
>>>> > group (.*$) _may_ make the PCRE library read the entire subject every
>>>> time.
>>>> > And I believe gb.pcre will copy your submatch string when returning
>>>> it.
>>>> > If you care deeply about this, you'll have to trace the code in
>>>> gb.pcre
>>>> > and main/gbx (the interpreter) to see what copies strings and what
>>>> doesn't.
>>>> >
>>>> > Regards,
>>>> > Tobi
>>>> >
>>>> > [1] http://www.regular-expressions.info/recursecapture.html
>>>> (Capturing
>>>> > Groups Inside Recursion or Subroutine Calls)
>>>> > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
>>>> > Numbering in Recursive Expressions)
>>>> >
>>>> > --
>>>> > "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>>>> >
>>>> > ------------------------------------------------------------
>>>> > ------------------
>>>> > Check out the vibrant tech community on one of the world's most
>>>> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> > _______________________________________________
>>>> > Gambas-user mailing list
>>>> > Gambas-user at lists.sourceforge.net
>>>> > https://lists.sourceforge.net/lists/listinfo/gambas-user
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Fernando Cabral
>>>> Blogue: http://fernandocabral.org
>>>> Twitter: http://twitter.com/fjcabral
>>>> e-mail <http://twitter.com/fjcabrale-mail>:
>>>> fernandojosecabral at ...626...
>>>> Facebook: f at ...3654...
>>>> Telegram: +55 (37) 99988-8868
>>>> Wickr ID: fernandocabral
>>>> WhatsApp: +55 (37) 99988-8868
>>>> Skype:  fernandojosecabral
>>>> Telefone fixo: +55 (37) 3521-2183
>>>> Telefone celular: +55 (37) 99988-8868
>>>>
>>>> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
>>>> nenhum político ou cientista poderá se gabar de nada.
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Gambas-user mailing list
>>>> Gambas-user at lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/gambas-user
>>>>
>>>
>>>
>>
>>
>> --
>> Fernando Cabral
>> Blogue: http://fernandocabral.org
>> Twitter: http://twitter.com/fjcabral
>> e-mail: fernandojosecabral at ...626...
>> Facebook: f at ...3654...
>> Telegram: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>> Wickr ID: fernandocabral
>> WhatsApp: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>> Skype:  fernandojosecabral
>> Telefone fixo: +55 (37) 3521-2183 <+55%2037%203521-2183>
>> Telefone celular: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>>
>> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
>> nenhum político ou cientista poderá se gabar de nada.
>>
>>
>



More information about the User mailing list