SnArF Posted September 9, 2014 Share Posted September 9, 2014 (edited) I had to compare two files with more than one million lines per file. I've tested several examples but all of them are too slow. Most of them are running for several hours to compare 1 million lines. I have written a script that compare's 2 txt files with 1 million lines in less than 5 minutes. (After the files are loaded in an array) It writes the missing files to 2 textfiles. It compares 10.000 lines in 1.8 sec, 100.000 lines in 21 sec, 1000.000 lines in 250 sec on my laptop. The example script creates 2 array's with 1.000.000 lines and then remove's some entry's. At the end it writes 2 txt files with the missing lines per array. Please test it and give commend's expandcollapse popup#include <array.au3> #include <Timers.au3> #include <file.au3> Local $NrOfRows = 1000000 ; Set number of rows to test Local $delString1 = 0 Local $delString2 = 0 Local $Array1[$NrOfRows] Local $Array2[$NrOfRows] $StartTime = _Timer_Init() $Timer = _Timer_Init() ; Creating 2 array's For $i = 0 to $NrOfRows - 1 $Array1[$i] = "Just some tekst to emulate data to compare " & $i Next $Array2 = $Array1 ConsoleWrite("Array's created in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ; removing some entry's from both array's to show functionality _ArrayDelete($Array1, "333;5555;7777") _ArrayDelete($Array2, "222;4444;6666") ConsoleWrite("Removed some value's in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ; You neede to sort the array is you use Binary Search _ArraySort($Array1, 0, 1, 0, 0, 1) ConsoleWrite("Sorted Array 1 in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ; comparing the 2 array's For $i = 0 to UBound($Array2) - 1 $Index = _ArrayBinarySearch($Array1, $Array2[$i], 1) ; add equal rows to a string If $Index <> -1 Then $delString1 &= ";" & $Index $delString2 &= ";" & $i EndIf Next ConsoleWrite("Array's compared in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ; removing the equal rows from the array's _ArrayDelete($Array1, $delString1) _ArrayDelete($Array2, $delString2) ConsoleWrite("removed equal rows in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ; writing the rsult to files _FileWriteFromArray("missing in array 1.txt", $Array2) _FileWriteFromArray("missing in array 2.txt", $Array1) ConsoleWrite("Write missing value's to File in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF) $Timer = _Timer_Init() ConsoleWrite("Compare complete in " &Round(_Timer_Diff($StartTime)) & " milliseconds") Edited September 9, 2014 by SnArF My scripts: _ConsoleWriteLog | _FileArray2D Link to comment Share on other sites More sharing options...
jguinch Posted September 9, 2014 Share Posted September 9, 2014 For big files, I suggest you to use an external tool. DiffUtils contains the program "diff.exe" which will be very very fast for this task. Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
UEZ Posted September 9, 2014 Share Posted September 9, 2014 (edited) Do you need only the information whether the 2 files are different or also what is different (content)? Br, UEZ Edited September 9, 2014 by UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
SnArF Posted September 9, 2014 Author Share Posted September 9, 2014 @UEZ, The script shows what's different (Content). I have a script that makes an index of 2 servers, about 1.500.000 files per server. The result are saved to 2 text files. Then the the text files are compared, only the different files are then saved to text files. The complete process, indexing 2 file server with 1.5 million files each and comparing them takes about 11 minutes, I think that's very fast. My scripts: _ConsoleWriteLog | _FileArray2D Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now