[Gambas-user] I need a hint on how to deleted duplicate items in a array

Fri Jun 30 15:05:24 CEST 2017

On Fri, 30 Jun 2017, Fernando Cabral wrote:
> 2017-06-30 7:44 GMT-03:00 Fabien Bodard <gambas.fr at ...626...>:
> 
> > The best way is the nando one ... at least for gambas.
> >
> > As you have not to matter about what is the index value or the order,
> > the walk ahead option is the better.
> >
> >
> > Then Fernando ... for big, big things... I think you need to use a DB.
> > Or a native language.... maybe a sqlite memory structure can be good.
> >
> 
> Fabien, since this is a one-time only thing, I don't think I'd be better
> off witha database.
> Basically, I read a text file an then break it down into words, sentences
> and paragraphs.
> Next I  count the items in each array (words, sentences paragraphs).
> Array.count works wonderfully.
> After that, have to eliminate the duplicate words (Array.words). But in
> doing it, al also have to count
> how many times each word appeared.
> 
> Finally I sort the Array.Sentences and the Array.Paragraphs by size
> (string.len()). The Array.WOrds are
> sorted by count + lenght. This is all woring good.
> 
> So, my quest is for the fastest way do eliminate the words duplicates while
> I count them.
> For the time being, here is a working solution based on system' s sort |
> uniq:
> 
> Here is one of the versions I have been using:
> 
> Exec ["/usr/bin/uniq", "Unsorted.txt", "Sorted.srt2"] Wait
> Exec ["/usr/bin/uniq", "-ci", "SortedWords.srt2",  SortedWords.srt3"] Wait
> Exec ["/usr/bin/sort", "-bnr", SortedWords.srt3] To UniqWords
> 

Are those temporary files? You can avoid those by piping your data into the
processes and reading their output directly. Otherwise the Temp$() function
gives you better temporary files.

> WordArray = split (UniqWords, "\n")
> 
> So, I end up with the result I want. It's effective. Now, it would be more
> elegant If I could do the same
> with Gambas. Of course, the sorting would be easy with the builting
> WordArray.sort ().
> But how about te '"/usr/bin/uniq", "-ci" ...' part?
> 

I feel like my other mail answered this, but I can give you another version
of that routine (which I said I would leave as an exercise to you):

  ' Remove duplicates in an array like "uniq -ci". String comparison is
  ' case insensitive. The i-th entry in the returned array counts how many
  ' times aStrings[i] (in the de-duplicated array) was present in the input.
  ' The data in ~aStrings~ is overridden. Assumes the array is sorted.
  Private Function Uniq(aStrings As String[]) As Integer[]
    Dim iSrc, iLast As Integer
    Dim aCount As New Integer[](aStrings.Count)

    If Not aStrings.Count Then Return []
    iLast = 0
    aCount[iLast] = 1
    For iSrc = 1 To aStrings.Max
      If String.Comp(aStrings[iSrc], aStrings[iLast], gb.IgnoreCase) Then
        Inc iLast
        aStrings[iLast] = aStrings[iSrc]
        aCount[iLast] = 1
      Else
        Inc aCount[iLast]
      Endif
    Next

    ' Now shrink the arrays to the memory they actually need
    aStrings.Resize(iLast + 1)
    aCount.Resize(iLast + 1)
    Return aCount
  End

What, in my opinion, is at least theoretically better here than the other
proposed solutions is that it runs in linear time, while nando's is
quadratic[*]. (Of course, if you sort beforehand, it will become n*log(n),
which is still better than quadratic.)

Attached is a test script with some words. It runs the sort + uniq utilities
first and then Array.Sort() + the Uniq() function above. The program then
prints the *diff* between the two outputs. I get an empty diff, meaning that
my Gambas routines produce exactly the same output as the shell utilities.

Regards,
Tobi

[*] He calls array functions Add() and Find() inside a For loop that runs
    over an array of size n. Adding elements to an array or searching an
    array have themselves worst-case linear complexity, giving quadratic
    overall. My implementation reserves some more space in advance to
    avoid calling Add() in a loop. Since the array is sorted, we can go
    without Find(), too. Actually, as you may know, adding an element to
    the end of an array can be implemented in amortized constant time
    (as C++'s std::vector does), by wasting space, but AFAICS Gambas
    doesn't do this, but I could be wrong.

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk
-------------- next part --------------
#!/usr/bin/gbs3

Private Const WORDS As String = ""
"Are those temporary files You can avoid those by piping your data into the "
"processes and reading their output directly Otherwise the Temp function "
"gives you better temporary files "
"Fabien since this is a onetime only thing I dont think Id be better "
"off witha database Basically I read a text file an then break it down into "
"words sentences and paragraphs Next I  count the items in each array "
"words sentences paragraphs Arraycount works wonderfully After that "
"have to eliminate the duplicate words Arraywords But in doing it al "
"also have to count how many times each word appeared"

Public Sub Main()
  Dim aStrings As String[]
  Dim aCount As Integer[]
  Dim iInd As Integer
  Dim sOut As String
  Dim sTemp As String = Temp$()
  Dim sTemp2 As String = Temp$()

  aStrings = Split(WORDS, " ")

  File.Save(sTemp, aStrings.Join("\n"))
  Exec ["sort", sTemp] To sOut
  File.Save(sTemp, sOut)
  Exec ["uniq", "-ci", sTemp] To sOut
  File.Save(sTemp, sOut)

  aStrings.Sort()
  aCount = Uniq(aStrings)
  sOut = ""
  For iInd = 0 To aStrings.Max
    sOut &= Subst$("&1 &2\n", Format$(aCount[iInd], "######0"), aStrings[iInd])
  Next
  File.Save(sTemp2, sOut)

  Exec ["diff", "-u", sTemp, sTemp2] Wait
End

' Remove duplicates in an array like "uniq -ci". String comparison is
' case insensitive. The i-th entry in the returned array counts how many
' times aStrings[i] (in the de-duplicated array) was present in the input.
' The data in ~aStrings~ is overridden. Assumes the array is sorted.
Private Function Uniq(aStrings As String[]) As Integer[]
  Dim iSrc, iLast As Integer
  Dim aCount As New Integer[](aStrings.Count)

  If Not aStrings.Count Then Return []
  iLast = 0
  aCount[iLast] = 1
  For iSrc = 1 To aStrings.Max
    If String.Comp(aStrings[iSrc], aStrings[iLast], gb.IgnoreCase) Then
      Inc iLast
      aStrings[iLast] = aStrings[iSrc]
      aCount[iLast] = 1
    Else
      Inc aCount[iLast]
    Endif
  Next

  ' Now shrink the arrays to the memory they actually need
  aStrings.Resize(iLast + 1)
  aCount.Resize(iLast + 1)
  Return aCount
End