[Gambas-user] Help needed from regexp gurus / String.Mid() and Mid$() implementation

Sun Jun 18 16:08:57 CEST 2017

On Sat, 17 Jun 2017, Fernando Cabral wrote:
> Tobi
> 
> One more thing about the way I wish it could work (I remember having done
> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> schematic, but I think it will clarify the issue.
> 
> Let p and l be arrays of integers and s be the string "abc defg hijkl"
> 
> So, after traversing the string we would have the following result:
> p[0] = offset of "a" (0)
> l[0] = length of "abc" (3)
> p[1] = offset of "d" (4)
> l[1] = lenght of "defg" (4)
> p[2] = offset of "h" (9)
> l[2] = lenght of "hijkl" (5).
> 
> After this, each word could be retrieved in the following manner:
> 
> for i = 0 to 2
>     print mid(s, p[i], l[i])
> next
> 
> I think this would be the most efficient way to do it. But I can't find how
> to do it in Gambas using Regex.
> 

As I said before, the Gambas String.Mid() and Mid$() functions do just that.
The internal representation of a string is some base data (which is usually
shared among many strings, via reference counting), an offset and a length.
If you apply String.Mid() or Mid$() to a string, no copying takes place, only
the offset and length members of the Gambas string structure are adjusted.
This is why Gambas strings are sometimes called "read-only" in the wiki (the
same string base data is shared by many strings, so you can't have external
libraries modify the data behind a Gambas string). Even the statement

  s = String.Mid$(s, 10, 20)

will *not* require a copy operation. You simply add 10 (UTF-8 positions) to
the offset member of the string structure and set the length member to 20
(UTF-8 positions) (or to the remaining length of s if it's smaller than 20).

String.Mid() and Mid$() are implemented exactly by manipulating offsets and
lengths, like you want to do. In fact there are multiple places in the Gambas
source tree where those two are used in place of a C-style

  for (i = 0; i < len; i++)
    do something with str[i];

loop. I suggest you look at the implementations yourself if you don't
believe it:

  String datatype: https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/share/gambas.h#l126
  Mid$():          https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/gbx/gbx_exec_loop.c#l3820
  String.Mid():    https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/gbx/gbx_c_string.c#l399

(I recommend downloading the source tree and using ctags or something to
navigate through it, of course, instead of the SF web interface.)

You should also try the following: create a console project with this code
in the Main.module:

   1 ' Gambas module file
   2
   3 Public Sub Main()
   4   Dim s As String = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
   5   Dim i As Integer
   6
   7   For i = 1 To 5
   8     s = String.Mid$(s, i, 2*i)
   9   Next
  10   s &= "a"
  11 End

It will call String.Mid$() multiple times. Now compile and run this program
through callgrind:

  $ cd /path/to/project
  $ gbc3 -ga
  $ valgrind --tool=callgrind gbx3

and use kcachegrind to visualise the callgraph. I'll attach the two
interesting graphs to this mail. One shows that the single invocation
of SUBR_cat (the &= operation at the end) needed a malloc() and hence
did something like copying the string, whereas the multiple invocations
of String_Mid do not lead to malloc or any other means of allocating
memory, meaning no copy operation takes place.

Assuming you aren't prematurely optimising here and performance is actually
poor with Gambas code you should look into doing it in C and possibly avoid
regular expressions altogether. If you always just want all the words in a
given string, you can do it in a single linear pass through your text.

But honestly, I would be surprised if you have bad performance by using
String.Mid$(), since it is really just using a map of offsets and lengths
on a single shared base string.

Regards,
Tobi

PS: Again about the definition of "word in a string". My point was that if
you say "a word in a string is a sequence of alphabetic characters followed
by a non-alphabetic character", then "c", "bc" and "abc" will be words in
the string "abc.", right? "c" is a (length-1) sequence of alphabetic
characters which is followed by the non-alphabetic character "." in the
string. But you don't want to call "c" alone a word because there are further
alphabetic characters in front of it. You want any *longest* sequence of
alphabetic characters which is followed by a non-alphabetic one, which in
the string above, would only be "abc".

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SUBR_cat-graph.png
Type: image/png
Size: 27782 bytes
Desc: not available
URL: <http://lists.gambas-basic.org/pipermail/user/attachments/20170618/4a24afab/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: String_Mid-graph.png
Type: image/png
Size: 25061 bytes
Desc: not available
URL: <http://lists.gambas-basic.org/pipermail/user/attachments/20170618/4a24afab/attachment-0001.png>