[Gambas-user] Help needed from regexp gurus / String.Mid() and Mid$() implementation

Sun Jun 18 17:47:50 CEST 2017

Tobi, I have been learning a lot with your comments and suggestions.

Usually, I don't stress performance too much.  In this case, I have tried
to do things a little faster because I am using Gambas as a kind of add-on
to LibreOffice. I use it to check readability, wordiness, sentences that
are too long, things like that. In order to avoid coding, I resorted to RE.
My first try at it resulted in very slow code. In fact, a 150-page long
document took about two and a half minute to check. That's not short enough
to stimulate someone to run the code repeatedly. That's when I tried
introducing RegExp.Replace to canonize the input text and then applying
split(). Split () is fast, but canonizing the input text proved slow.  Then
I resorted to RegExp.Compile and RegExp. Way faster than Regex.Replace(),
but still slow. That's when I asked for help.

Today, following Jussi's suggestion, I went back to Split(). Without the
canonization phase, the same 150-page document was processed in 1,5 second.
That' s 100 times faster than the original version. That is great.
Nevertheless, there a few issues I have not been able to solve in an
elegant way.

Anyway, for this magnitude of performance gain, I am quite willing to go
back to Split () and then do soma massaging in those situations where
results are less than perfect.

Thank you for the many hints you've provided me with.

Regards

- fernando

2017-06-18 11:08 GMT-03:00 Tobias Boege <taboege at ...626...>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > Tobi
> >
> > One more thing about the way I wish it could work (I remember having done
> > this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> > schematic, but I think it will clarify the issue.
> >
> > Let p and l be arrays of integers and s be the string "abc defg hijkl"
> >
> > So, after traversing the string we would have the following result:
> > p[0] = offset of "a" (0)
> > l[0] = length of "abc" (3)
> > p[1] = offset of "d" (4)
> > l[1] = lenght of "defg" (4)
> > p[2] = offset of "h" (9)
> > l[2] = lenght of "hijkl" (5).
> >
> > After this, each word could be retrieved in the following manner:
> >
> > for i = 0 to 2
> >     print mid(s, p[i], l[i])
> > next
> >
> > I think this would be the most efficient way to do it. But I can't find
> how
> > to do it in Gambas using Regex.
> >
>
> As I said before, the Gambas String.Mid() and Mid$() functions do just
> that.
> The internal representation of a string is some base data (which is usually
> shared among many strings, via reference counting), an offset and a length.
> If you apply String.Mid() or Mid$() to a string, no copying takes place,
> only
> the offset and length members of the Gambas string structure are adjusted.
> This is why Gambas strings are sometimes called "read-only" in the wiki
> (the
> same string base data is shared by many strings, so you can't have external
> libraries modify the data behind a Gambas string). Even the statement
>
>   s = String.Mid$(s, 10, 20)
>
> will *not* require a copy operation. You simply add 10 (UTF-8 positions) to
> the offset member of the string structure and set the length member to 20
> (UTF-8 positions) (or to the remaining length of s if it's smaller than
> 20).
>
> String.Mid() and Mid$() are implemented exactly by manipulating offsets and
> lengths, like you want to do. In fact there are multiple places in the
> Gambas
> source tree where those two are used in place of a C-style
>
>   for (i = 0; i < len; i++)
>     do something with str[i];
>
> loop. I suggest you look at the implementations yourself if you don't
> believe it:
>
>   String datatype: https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/share/gambas.h#l126
>   Mid$():          https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/gbx/gbx_exec_loop.c#l3820
>   String.Mid():    https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/gbx/gbx_c_string.c#l399
>
> (I recommend downloading the source tree and using ctags or something to
> navigate through it, of course, instead of the SF web interface.)
>
> You should also try the following: create a console project with this code
> in the Main.module:
>
>    1 ' Gambas module file
>    2
>    3 Public Sub Main()
>    4   Dim s As String = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
>    5   Dim i As Integer
>    6
>    7   For i = 1 To 5
>    8     s = String.Mid$(s, i, 2*i)
>    9   Next
>   10   s &= "a"
>   11 End
>
> It will call String.Mid$() multiple times. Now compile and run this program
> through callgrind:
>
>   $ cd /path/to/project
>   $ gbc3 -ga
>   $ valgrind --tool=callgrind gbx3
>
> and use kcachegrind to visualise the callgraph. I'll attach the two
> interesting graphs to this mail. One shows that the single invocation
> of SUBR_cat (the &= operation at the end) needed a malloc() and hence
> did something like copying the string, whereas the multiple invocations
> of String_Mid do not lead to malloc or any other means of allocating
> memory, meaning no copy operation takes place.
>
> Assuming you aren't prematurely optimising here and performance is actually
> poor with Gambas code you should look into doing it in C and possibly avoid
> regular expressions altogether. If you always just want all the words in a
> given string, you can do it in a single linear pass through your text.
>
> But honestly, I would be surprised if you have bad performance by using
> String.Mid$(), since it is really just using a map of offsets and lengths
> on a single shared base string.
>
> Regards,
> Tobi
>
> PS: Again about the definition of "word in a string". My point was that if
> you say "a word in a string is a sequence of alphabetic characters followed
> by a non-alphabetic character", then "c", "bc" and "abc" will be words in
> the string "abc.", right? "c" is a (length-1) sequence of alphabetic
> characters which is followed by the non-alphabetic character "." in the
> string. But you don't want to call "c" alone a word because there are
> further
> alphabetic characters in front of it. You want any *longest* sequence of
> alphabetic characters which is followed by a non-alphabetic one, which in
> the string above, would only be "abc".
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>
>

-- 
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: fernandojosecabral at ...626...
Facebook: f at ...3654...
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.