[Gambas-user] Regex - expert opinion requested

Thu Jun 1 00:21:41 CEST 2017

On Wed, 31 May 2017, Fernando Cabral wrote:
> This is only for those who like to work with regular expressions.
> It is a performance issue. I am using 26 different regular expressions of
> this kind:
> 
> txt = RegExp.Replace(TextoBruto, NaoNumerais, "&1\n", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "\n\n+?", "\n", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "^\n+?", "", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "\n+?$", "", RegExp.UTF8)
> 
> Those are pretty fast. Less than one second for a text with 415KB (about
> six thousand lines).
> 
> But the following code is quite slow. About 27 seconds each:
> 
> ttDigitos = String.Len(RegExp.Replace(TextoBruto, "[^0-9]", "",
> RegExp.UTF8)) ' 27 segundos
> ttPontuacao = String.Len(RegExp.Replace(TextoBruto, "[^.:;,?!]", "",
> RegExp.UTF8))  ' 27 segundos
> ttBrancos = String.Len(RegExp.Replace(TextoBruto, "[^ \t]", "",
> RegExp.UTF8))   ' 27 segundos
> Print "Especial antigo", Now
> 'ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> "[^-[\\](){}\"@#$%&*_+=<>/\\\\|ºª§“”‘’]", "", RegExp.UTF8))  ' 27 segundos
> Print "Especial novo", Now
> ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> "[-aeiouãáéíóúâõàbcçdfghjlmnpqrstvxyz
> ,.:;!?()0-9êôwkèìòùäÄÁÉÍÓÚÀÈÌÒÙÂÔÂÊÔÇABCDEFGHIJKLMNOPQRSTUVWXYZ]", "",
> RegExp.UTF8))  ' 27 segundos
> Print "fim especial novo", Now
> 
> Quite slow. The whole programm takes 2 minutes to run. The above lines
> alone consume 108 seconds (108:120).
> 
> I tried some variations. For instance, ttEspeciais = .... has two versions.
> One negates what to leave in, the other describes what to take out. End
> result is the same. And so is the time spent.
> 
> I have also written a much longer code that does the same thing using loops
> and searching for the characters I want in or want out. The whole thing
> runs in about 5 seconds (but this code took me much, much longer do write).
> 
> I wonder if any of you could suggest potentially faster RegExp that could
> replace the specimens above.
> 

This sounds interesting, because for one thing I can't imagine a pipe chain
of "sed" invocations to take this long on just 500 KiB input (but I could
be wrong).

Also, in case you didn't know, the IDE also has a very handy profiler
(Debug > Activate profiling menu). It lets you take a somewhat closer look
at where your code spends its time, but it may not be of much help here.

About your regular expressions: I think the key point is that you are really
just erasing characters of character classes. Your expressions are extremely
simple in that regard. You mentioned that avoiding regular expressions gives
you a big speedup but the code took you longer to write. I don't see why.
You should be able to write a general function

  Private Function EraseClass(sStr As String, sClass As String) As String

which erases from sStr every character that is in sClass, using a simple
loop and String.InStr().

You can probably even abuse the Split() function for this. To remove any
single character in sClass from the string sStr, do:

  Split(sStr, sClass).Join("")

Split() probably won't behave well with multibyte characters, though, such
as the UTF-8 you require above. With both attempts it is harder to implement
the "[^...]" inverse character class syntax.

Regardless, I would be a little interested in getting a sample project which
includes your regular expressions and such a text file, to see for myself
where the time is exactly spent. Can you send a version of your project that
contains only the parts relevant to these regular expressions?

Regards,
Tobi

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk