The function _ArrayUnique in the Array UDF has some serious design flaws in it, not to mention it's incredibly slow. This is a rewrite of that function. Practically nothing from the original function has remained.

This version...

  • doesn't rely upon using _ArrayAdd, not using that shaved 4 seconds off when using this with a 4100+ row array.
  • has extensive bounds checking on all parameters of the function, something the original didn't do at all.
  • streamlines the error checking process so it returns quicker when something sent to it isn't set correctly, all done at the top of the function
  • doesn't use Dim
  • eliminates the $vDelim parameter, which even the documentation says about it, "However, cannot forsee its usefullness"
  • adds the ability to return a 2D array if sent a 2D array, and the $bReturn2 parameter is set to true. This will search the column requested, finding any duplicated entries, removing those duplicate entries, but returning all the rows that were found first. I don't know if anyone will find any use for this part of the function, but it was surprisingly easy to implement it, so I figured I'd leave it in.

This version is approximately 50 - 80% faster than the original, and won't crash your script if you send it the wrong values in the parameters. Please play around with this to see if you can break it, or improve it. There is an example script inside the archive that uses the MusicList file as input. The MusicList file is just a text file with 4104 lines and 6 columns, try running the new and original versions using the 6th column as a search and see the difference in execution time.

Returning a 2D array on a large array is going to take a lot longer than just returning a single dimension of the array. Using it on the 6th column of this array is still faster than returning a 1D array using the original function.


Do you think you will update the UDF version with this one?

Posted

There was already an attempt by wraithdu -> '?do=embed' frameborder='0' data-embedContent>> and by me somewhere in this forum.

Maybe it is worth to look at all the versions and merge all into a new super-duper one to replace current one.



Edited

And this one >> 

Well it's just an idea and would need some tweaking to match the similar functionality of _ArrayUnqiue().

And this one >> 

Well it's just an idea and would need some tweaking to match the similar functionality of _ArrayUnqiue().

Unfortunately, Jon has declared that Assign working with any text at all, instead of a string that would be a valid variable name, is a bug in the function.

Trak ticket : #2478

There was already an attempt by wraithdu -> '?do=embed' frameborder='0' data-embedContent>> and by me somewhere in this forum.

Maybe it is worth to look at all the versions and merge all into a new super-duper one to replace current one.


Damn, that one is fast.


BTW, there's a bug in the function I posted, too quick with the copy/paste.

; This line
    If ($iColumn < 1) Or ($iNumColumns = 0 And ($iColumn - 1 > $iNumColumns)) Or ($iNumColumns > 0 And ($iColumn -1 > $iNumColumns)) Then Return SetError(3, 0, 0)
; needs to be changed to this
    If ($iColumn < 1) Or ($iNumColumns = 0 And ($iColumn - 1 > $iNumColumns)) Or ($iNumColumns > 0 And ($iColumn > $iNumColumns)) Then Return SetError(3, 0, 0)

Posted

Unfortunately, Jon has declared that Assign working with any text at all, instead of a string that would be a valid variable name, is a bug in the function.

Trak ticket : #2478


That's not an issue at all. Just declare the binary of each string instead. Tweak the input for case insensitivity. The real issue with that technique is what I mentioned >here. If the technique was to be employed, I believe it would be wise to add a cautionary note about performance degradation.

Edited
Posted

A very good point you made in that post, which is probably why the original ArrayUnique slows to a crawl on large arrays with no duplicates.

I'm very surprised that the current version of _ArrayUnique ever made it into the Array UDF in it's broken state. I'm not even talking about the slowness, I'm talking about the fact that none of the parameters, are checked, the array is only checked to see if it's an array. It doesn't take much to make it error out when using it.

Edited

You'd have to sort the array you're searching in, which in this case is the $T_Array, and you'd have to sort it after every insertion.

Posted

I believe turning all rows to delimitered strings may be a viable option. I do this with the elements of a 1 dimensional array to achieve significantly faster results. The code is not currently reliable because the delimiters need to be tested.

Alternatively it may (I don't know) still be possible to use the variable declaration technique on portions of the array and test all portions against each other to limit the number of declarations. This approach might still beat the speed of all other methods hands down. Not an easy function to create, but it might be possible. Such a function would require extreme rigorous testing. You could remove dupes as you go using helper functions. For example parse the array 1000 rows at a time, starting with the last element, and use Redim after each batch of 1000.

On reflection this idea probably won't work (I need to think about this a while longer). The length of each row could still cause problems. See >here.

I also thought about combining two different approaches.

Edited
Posted

There are some issues with this implementation but this is how I always interpreted _ArrayUnique to work.  Am I right? (BrewManNH's MusicList is needed.)


#include <Array.au3>


Func _main()
  Local Const $array = _create_array()
  Local $timer = TimerInit()
  Local Const $unique_array1 = __ArrayUnique($array, 0)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)
  $timer = TimerInit()
  Local Const $unique_array2 = __ArrayUnique($array, 1)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)
  $timer = TimerInit()
  Local Const $unique_array3 = __ArrayUnique($array, 2)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)
  $timer = TimerInit()
  Local Const $unique_array4 = __ArrayUnique($array, 3)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)
  $timer = TimerInit()
  Local Const $unique_array5 = __ArrayUnique($array, 4)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)
  $timer = TimerInit()
  Local Const $unique_array6 = __ArrayUnique($array, 5)
  ConsoleWrite((TimerDiff($timer) / 1000) & @CRLF)

Func __ArrayUnique(Const $array, Const $column = 0, Const $zero_base = True)
  If Not IsArray($array) Then Return SetError(1, 0, False)
  Local Const $array_size = UBound($array)
  Local $new_array = array_column_to_1dim($array, $column)
  Local $unique_array[1] = ['']
  Local $k = 0
  Local $term = ''
  Local $last_term = ''
  For $i = 0 To $array_size - 1
    $term = $new_array[$i]
    If $last_term <> $term Then
      For $j = $i + 1 To $array_size - 1
        If $term <> $new_array[$j] Then
          $unique_array[$k] = $term
          $k += 1
          ReDim $unique_array[$k + 1]
          $last_term = $term
  Return $unique_array

Func array_column_to_1dim(Const $array, Const $column)
  Local Const $array_size = UBound($array)
  Local $new_array[$array_size] = ['']
  For $i = 0 To $array_size - 1
    $new_array[$i] = $array[$i][$column]
  Return $new_array

Func _create_array() ; BrewManNH
  Local Const $hFile = FileOpen(@ScriptDir & "\MusicList")
  Local Const $TempText = FileReadLine($hFile)
  Local $aTempText = StringSplit($TempText, '-')
  Local Const $Count = Number($aTempText[2]) ; get the song count from the music list header info
  Local $Array[$Count][6]
  Local $TmpLine = ''
  For $I = 0 To $Count
    $TmpLine = FileReadLine($hFile)
    If @error Then ExitLoop
    $aTmpLine = StringSplit($TmpLine, '|')
    For $X = 1 To 6
      $Array[$I][$X - 1] = $aTmpLine[$X]
  Return $Array

The last colum takes the longest time on my machine at 12 seconds.

Edited

Other than it sorting the array, that method is pretty fast, although I think that wraithdu's scripting dictionary version is going to be the big winner in speed.

My timing tests using the musiclist posted in the first post, in order of speed.


Time taken = 0.0766695356048192 seconds - wraithdu's function

Time taken = 5.96837603512747 seconds - jabberwocky's function

Time taken = 30.0241667517623 seconds - My function

Time taken = 140.443000520299 seconds - Original function

I did notice a couple of items on your function jabberwocky, the column number doesn't match the original function's parameter, it assumes the first column is one and your's assumes 0. Also, the original returns the count in the 0 element of the array, where your's doesn't. Both are minor and easily fixed though.

I tried it with an SQLite memory database, I can get a very consistent 1.3 seconds on every column, plus or minus a few milliseconds.

  • 1 month later...

it is faster than the current UDF.

is there are way it can return 2-dimensional array?

below are the array i got.

$aArray[10] = ["Alan", "James", "Alan"," Alan", "John", "John"," Alan", "James", "John"," John"]

when you use _ArrayUnique($aArray) /__ArrayUnique($aArray) - No matter fast or slow

Return array will be:-

$aArray[0] = "Alan"

$aArray[1] = "James"

$aArray[2] = "John"

Currently i am looking for a array unique that can also display the total number of the unique in 2D version like below:

 Row | Col0   | Col1

   0    | Alan   |   4

   1    | James|   2

   2    | John  |   4

Column 1 will show the number of the repeat in array.

Can anyone help me with this?

Your help will be very appreciated. thank you


Very quickly, the code is executed based on Scripting.Dictionary.

Using Assign, Eval, IsDeclared makes the code more quickly, but has a problem with the interpretation of the character "[". Well if this function was performed within AutoIt3. We would get a higher speed.

