[Gambas-user] Get distinct array from large array
Tobias Boege
tobs at taboege.de
Fri Dec 10 11:37:33 CET 2021
On Fri, 10 Dec 2021, Safiur Rahman wrote:
> Hi
>
> I want to get a string array of distinct items from another string array
> which has large count (about a million). I hope Benoît will add a function
> for this. Till then what can be the fastest way to get array of distinct
> items?
>
> I am using following code to achieve that. But it is slow
>
> Dim yyy As String[]
> Dim xx As String
>
> yyy = New String[]
> For Each xx In xxx
> If yyy.Count = 0 Then
> yyy.Add(xx)
> Else If yyy.Count > 0 Then
> If yyy.Find(xx, gb.Binary) = -1 Then
> yyy.Add(xx)
> Endif
> Endif
> Next
>
This is expensive (in theory) because for each of your million items in
xxx you run over the deduplicated items yyy to see if you are currently
dealing with another duplicate or not. That inner loop is hidden in your
yyy.Find() call. Moreover, the cost of those string comparisons that you
do over and over depends on the length of your strings.
It would be more efficient if there was a fast way to map each of your
items to some "mostly unique" but short number and a fast way to check
for your collection of short numbers whether what you have is a duplicate.
What I am talking about is called "hash table" and it is implemented in
Gambas under the name Collection.
In theory, the following should be faster (at the expense of using more
memory for the added Collection), but please do try it out on real data:
Public Function Deduplicate(xxx As String[]) As String[]
Dim yyy As New String[]
Dim stash As New Collection
For Each x As String in xxx
If stash.Exist(x) Then Continue
stash[x] = True
yyy.Add(x)
Next
Return yyy
End
Note that this assumes that your items will always be strings. If you
want to deduplicate arbitrary objects with an overloaded comparison
operator, then the story is different. Depending on the nature of your
custom comparison, you might be able to sort the array of objects and
then weed out all duplicates in a single iteration. This would (again,
in theory) still be faster than your two nested loops would be in the
worst case. I can elaborate on that if you need it. If you are happy
with string data, then it is not needed.
Best,
Tobias
--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk
More information about the User
mailing list