[Gambas-user] Get distinct array from large array

Tobias Boege tobs at taboege.de
Fri Dec 10 11:37:33 CET 2021


On Fri, 10 Dec 2021, Safiur Rahman wrote:
> Hi
> 
> I want to get a string array of distinct items from another string array
> which has large count (about a million). I hope Benoît will add a function
> for this. Till then what can be the fastest way to get array of distinct
> items?
> 
> I am using following code to achieve that. But it is slow
> 
> Dim yyy As String[]
>   Dim xx As String
> 
>   yyy = New String[]
>   For Each xx In xxx
>     If yyy.Count = 0 Then
>       yyy.Add(xx)
>     Else If yyy.Count > 0 Then
>       If yyy.Find(xx, gb.Binary) = -1 Then
>         yyy.Add(xx)
>       Endif
>     Endif
>   Next
> 

This is expensive (in theory) because for each of your million items in
xxx you run over the deduplicated items yyy to see if you are currently
dealing with another duplicate or not. That inner loop is hidden in your
yyy.Find() call. Moreover, the cost of those string comparisons that you
do over and over depends on the length of your strings.

It would be more efficient if there was a fast way to map each of your
items to some "mostly unique" but short number and a fast way to check
for your collection of short numbers whether what you have is a duplicate.
What I am talking about is called "hash table" and it is implemented in
Gambas under the name Collection.

In theory, the following should be faster (at the expense of using more
memory for the added Collection), but please do try it out on real data:

  Public Function Deduplicate(xxx As String[]) As String[]
    Dim yyy As New String[]
    Dim stash As New Collection

    For Each x As String in xxx
      If stash.Exist(x) Then Continue
      stash[x] = True
      yyy.Add(x)
    Next
    Return yyy
  End

Note that this assumes that your items will always be strings. If you
want to deduplicate arbitrary objects with an overloaded comparison
operator, then the story is different. Depending on the nature of your
custom comparison, you might be able to sort the array of objects and
then weed out all duplicates in a single iteration. This would (again,
in theory) still be faster than your two nested loops would be in the
worst case. I can elaborate on that if you need it. If you are happy
with string data, then it is not needed.

Best,
Tobias

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk


More information about the User mailing list