[Gambas-user] I need a hint on how to deleted duplicate items in a array
Tobias Boege
taboege at ...626...
Fri Jun 30 15:05:24 CEST 2017
On Fri, 30 Jun 2017, Fernando Cabral wrote:
> 2017-06-30 7:44 GMT-03:00 Fabien Bodard <gambas.fr at ...626...>:
>
> > The best way is the nando one ... at least for gambas.
> >
> > As you have not to matter about what is the index value or the order,
> > the walk ahead option is the better.
> >
> >
> > Then Fernando ... for big, big things... I think you need to use a DB.
> > Or a native language.... maybe a sqlite memory structure can be good.
> >
>
> Fabien, since this is a one-time only thing, I don't think I'd be better
> off witha database.
> Basically, I read a text file an then break it down into words, sentences
> and paragraphs.
> Next I count the items in each array (words, sentences paragraphs).
> Array.count works wonderfully.
> After that, have to eliminate the duplicate words (Array.words). But in
> doing it, al also have to count
> how many times each word appeared.
>
> Finally I sort the Array.Sentences and the Array.Paragraphs by size
> (string.len()). The Array.WOrds are
> sorted by count + lenght. This is all woring good.
>
> So, my quest is for the fastest way do eliminate the words duplicates while
> I count them.
> For the time being, here is a working solution based on system' s sort |
> uniq:
>
> Here is one of the versions I have been using:
>
> Exec ["/usr/bin/uniq", "Unsorted.txt", "Sorted.srt2"] Wait
> Exec ["/usr/bin/uniq", "-ci", "SortedWords.srt2", SortedWords.srt3"] Wait
> Exec ["/usr/bin/sort", "-bnr", SortedWords.srt3] To UniqWords
>
Are those temporary files? You can avoid those by piping your data into the
processes and reading their output directly. Otherwise the Temp$() function
gives you better temporary files.
> WordArray = split (UniqWords, "\n")
>
> So, I end up with the result I want. It's effective. Now, it would be more
> elegant If I could do the same
> with Gambas. Of course, the sorting would be easy with the builting
> WordArray.sort ().
> But how about te '"/usr/bin/uniq", "-ci" ...' part?
>
I feel like my other mail answered this, but I can give you another version
of that routine (which I said I would leave as an exercise to you):
' Remove duplicates in an array like "uniq -ci". String comparison is
' case insensitive. The i-th entry in the returned array counts how many
' times aStrings[i] (in the de-duplicated array) was present in the input.
' The data in ~aStrings~ is overridden. Assumes the array is sorted.
Private Function Uniq(aStrings As String[]) As Integer[]
Dim iSrc, iLast As Integer
Dim aCount As New Integer[](aStrings.Count)
If Not aStrings.Count Then Return []
iLast = 0
aCount[iLast] = 1
For iSrc = 1 To aStrings.Max
If String.Comp(aStrings[iSrc], aStrings[iLast], gb.IgnoreCase) Then
Inc iLast
aStrings[iLast] = aStrings[iSrc]
aCount[iLast] = 1
Else
Inc aCount[iLast]
Endif
Next
' Now shrink the arrays to the memory they actually need
aStrings.Resize(iLast + 1)
aCount.Resize(iLast + 1)
Return aCount
End
What, in my opinion, is at least theoretically better here than the other
proposed solutions is that it runs in linear time, while nando's is
quadratic[*]. (Of course, if you sort beforehand, it will become n*log(n),
which is still better than quadratic.)
Attached is a test script with some words. It runs the sort + uniq utilities
first and then Array.Sort() + the Uniq() function above. The program then
prints the *diff* between the two outputs. I get an empty diff, meaning that
my Gambas routines produce exactly the same output as the shell utilities.
Regards,
Tobi
[*] He calls array functions Add() and Find() inside a For loop that runs
over an array of size n. Adding elements to an array or searching an
array have themselves worst-case linear complexity, giving quadratic
overall. My implementation reserves some more space in advance to
avoid calling Add() in a loop. Since the array is sorted, we can go
without Find(), too. Actually, as you may know, adding an element to
the end of an array can be implemented in amortized constant time
(as C++'s std::vector does), by wasting space, but AFAICS Gambas
doesn't do this, but I could be wrong.
--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk
-------------- next part --------------
#!/usr/bin/gbs3
Private Const WORDS As String = ""
"Are those temporary files You can avoid those by piping your data into the "
"processes and reading their output directly Otherwise the Temp function "
"gives you better temporary files "
"Fabien since this is a onetime only thing I dont think Id be better "
"off witha database Basically I read a text file an then break it down into "
"words sentences and paragraphs Next I count the items in each array "
"words sentences paragraphs Arraycount works wonderfully After that "
"have to eliminate the duplicate words Arraywords But in doing it al "
"also have to count how many times each word appeared"
Public Sub Main()
Dim aStrings As String[]
Dim aCount As Integer[]
Dim iInd As Integer
Dim sOut As String
Dim sTemp As String = Temp$()
Dim sTemp2 As String = Temp$()
aStrings = Split(WORDS, " ")
File.Save(sTemp, aStrings.Join("\n"))
Exec ["sort", sTemp] To sOut
File.Save(sTemp, sOut)
Exec ["uniq", "-ci", sTemp] To sOut
File.Save(sTemp, sOut)
aStrings.Sort()
aCount = Uniq(aStrings)
sOut = ""
For iInd = 0 To aStrings.Max
sOut &= Subst$("&1 &2\n", Format$(aCount[iInd], "######0"), aStrings[iInd])
Next
File.Save(sTemp2, sOut)
Exec ["diff", "-u", sTemp, sTemp2] Wait
End
' Remove duplicates in an array like "uniq -ci". String comparison is
' case insensitive. The i-th entry in the returned array counts how many
' times aStrings[i] (in the de-duplicated array) was present in the input.
' The data in ~aStrings~ is overridden. Assumes the array is sorted.
Private Function Uniq(aStrings As String[]) As Integer[]
Dim iSrc, iLast As Integer
Dim aCount As New Integer[](aStrings.Count)
If Not aStrings.Count Then Return []
iLast = 0
aCount[iLast] = 1
For iSrc = 1 To aStrings.Max
If String.Comp(aStrings[iSrc], aStrings[iLast], gb.IgnoreCase) Then
Inc iLast
aStrings[iLast] = aStrings[iSrc]
aCount[iLast] = 1
Else
Inc aCount[iLast]
Endif
Next
' Now shrink the arrays to the memory they actually need
aStrings.Resize(iLast + 1)
aCount.Resize(iLast + 1)
Return aCount
End
More information about the User
mailing list