Quinch Posted January 23, 2012 Share Posted January 23, 2012 (edited) I've got a small problem I could use some help with. I've put together a script that downloads a lot of small files, then compares them against a list to see which ones to keep. The list is simple enough, a twodimensional array consisting of a MD5 hash and filenames that match that hash. The script basically goes through an for-to loop comparing each value, but since the array will grow into tens or hundreds of thousands of entries, the one-by-one kind of approach is likely to turn into something that runs at a snail's pace. With that in mind, are there ways, preferably simple, to speed it up? Would loading the entire index as a single string variable and using stringinstr and similar commands to check for and edit values be faster? Edited January 23, 2012 by Quinch Link to comment Share on other sites More sharing options...
water Posted January 23, 2012 Share Posted January 23, 2012 Are the lists sorted by name? My UDFs and Tutorials: Spoiler UDFs: Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki Standard UDFs: Excel - Example Scripts - Wiki Word - Wiki Tutorials: ADO - Wiki WebDriver - Wiki Link to comment Share on other sites More sharing options...
Quinch Posted January 23, 2012 Author Share Posted January 23, 2012 Nope... although... if I sort the list by hash value, maybe I could go by a "by half" route - start looking at the middle, if the hash to be compared against is greater/smaller than the midpoint, split the remains, etc until the un-eliminated section to be searched is something reasonable. That still leaves me looking for filenames, though - each hash has one or more filenames attached to it, and if a match is found, return the first filename associated with that hash, so the sort-and-split approach wouldn't work there. I've attached both the script and an example of the list - the index is the result of a single page backup, and since the script is intended as a mid-to-long term solution, it's going to grow - a lot.archiver.au3MD5_index.txt Link to comment Share on other sites More sharing options...
DicatoroftheUSA Posted January 23, 2012 Share Posted January 23, 2012 (edited) I found filelisttoarray to be very slow, I prefer to use string based file searches such as FileFindFirstFile() etc... Edited January 23, 2012 by DicatoroftheUSA Statism is violence, Taxation is theft. Autoit Wiki Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted January 23, 2012 Moderators Share Posted January 23, 2012 DicatoroftheUSA,Have you looked inside the _FilelistToArray function? Probably not or you would have seen: Func _FileListToArray($sPath, $sFilter = "*", $iFlag = 0) Local $hSearch, $sFile, $sFileList, $sDelim = "|" $sPath = StringRegExpReplace($sPath, "[/]+z", "") & "" ; ensure single trailing backslash If Not FileExists($sPath) Then Return SetError(1, 1, "") If StringRegExp($sFilter, "[/:><|]|(?s)As*z") Then Return SetError(2, 2, "") If Not ($iFlag = 0 Or $iFlag = 1 Or $iFlag = 2) Then Return SetError(3, 3, "") $hSearch = FileFindFirstFile($sPath & $sFilter) ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< If @error Then Return SetError(4, 4, "") While 1 $sFile = FileFindNextFile($hSearch) ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< If @error Then ExitLoop If ($iFlag + @extended = 2) Then ContinueLoop $sFileList &= $sDelim & $sFile WEnd FileClose($hSearch) If Not $sFileList Then Return SetError(4, 4, "") Return StringSplit(StringTrimLeft($sFileList, 1), "|") EndFunc ;==>_FileListToArrayHow else did you think it worked? M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
Mikeman27294 Posted January 23, 2012 Share Posted January 23, 2012 That works similarly to something I have been doing recently - I was using the pipe as a delimiter and splitting at that to make my arrays. I would suggest something like that. It doesn't really matter if it doesn't look nice in the archive, does it? You could just have "FileName|MD5|AnotherFileName|MD5" Then you could create the array based on stringsplitting by the pipe ( | - that is a pipe, if you dont know.) Link to comment Share on other sites More sharing options...
DicatoroftheUSA Posted January 23, 2012 Share Posted January 23, 2012 (edited) @Melba23Yes, but what I mean is I personally tend to avoid using arrays entirely.I am not the best at using Autoit, but whenever I use arrays, rather than strings only it seems rather slow. Especially if re-dim is required. Edited January 23, 2012 by DicatoroftheUSA Statism is violence, Taxation is theft. Autoit Wiki Link to comment Share on other sites More sharing options...
Spiff59 Posted January 23, 2012 Share Posted January 23, 2012 Even though the MD5's are not unique, your filenames are, so sorting by name, as Water mentioned, would allow you to do binary searches. Beyond that, you could split your huge list into parts, for instance, 27 arrays, A-Z + non-alpha, and search the smaller arrays depending on the first character of the filename. Or, you could always go with using a database. Come to think of it, you could probably also "cheat" and use Yashied's Assign/IsDeclared route which is lightning fast. Assign your filenames as the variable name (with some sort of appended prefix or suffix to avoid conflicts with "real" variables) and set the MD5 as the value of the dummy variable. Maybe something like: Global $array[5][2] = [[4,0],["Acrobat.exe", 3333333333],["Explorer.exe", 7777777777],["Windows.exe", 1234567890],["Winzip.exe", 99999999999]] Global $filename, $filehash For $x = 1 to $array[0][0] ; load array to variables Assign("__" & $array[$x][0], $array[$x][1]) Next $filename = "Explorer.exe" $filehash = 6666666666 MsgBox(0,"","Found = " & Search()) $filename = "Windows.exe" $filehash = 1234567890 MsgBox(0,"","Found = " & Search()) Func Search() If IsDeclared("__" & $filename) And Eval("__" & $filename) = $filehash Then Return 1 EndFunc Link to comment Share on other sites More sharing options...
jchd Posted January 23, 2012 Share Posted January 23, 2012 Quinch, Look around for ScriptingDictionary. It should do the job efficiently for your purpose. Of course, use the hash as key and filename as value. Another option would be to use SQLite for that but it may be a bit slower and less easy to use for such a simple application if it's used once. The advantage of an SQLite database is that you can grow your database as much as you need without (essentially) any performance penalty until you reach an undecent number of hash/filename couples. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted January 23, 2012 Moderators Share Posted January 23, 2012 DicatoroftheUSA,Especially if re-dim is requiredNow there you are quite correct! But arrays are very useful and there are ways to reduce the number of ReDims you need - even if you do not know how big your array will become. Take a look at the final example in the Recursion tutorial in the Wiki. M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
KaFu Posted January 23, 2012 Share Posted January 23, 2012 Look around for ScriptingDictionary. It should do the job efficiently for your purpose. Of course, use the hash as key and filename as value. OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2024-Oct-20) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16) Link to comment Share on other sites More sharing options...
Quinch Posted January 23, 2012 Author Share Posted January 23, 2012 Well... crap. That's one step forward and two steps back. I've been looking at some of these filenames and it's pretty obvious that they can't be the one and the same file. You're right - MD5 hashes are't unique - or unique enough, at least - to reliably use as a check if two files are identical in content. Which brings me to a new question - short of comparing file contents outright, what would be? Would a layered approach work? For example, check for filesize matches, then check those for MD5 if a match is found? When dealing with thousands of files mostly under 10Kb, what would the reliability rate be between those two, offhand? Link to comment Share on other sites More sharing options...
BrewManNH Posted January 23, 2012 Share Posted January 23, 2012 If you're using a 2D array, you could sort the array on the "column" that you want to search in and use the _Array.au3 file in my signature, which includes a 2D version of _ArrayBinarySearch which would greatly speed up the searching process. If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag GudeHow to ask questions the smart way! I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from. Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays. - ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script. - Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label. - _FileGetProperty - Retrieve the properties of a file - SciTE Toolbar - A toolbar demo for use with the SciTE editor - GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI. - Latin Square password generator Link to comment Share on other sites More sharing options...
jchd Posted January 24, 2012 Share Posted January 24, 2012 I beg your pardon. Do any of you guys seriously expect MD5 collisions on random 10k files? I mean spurious MD5 collisions on non-cooked data/executable, real-world files? Can you exhibit one non_cooked example (just curious)? Hint: that would be a world record! MvGulik and KaFu 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
MvGulik Posted January 24, 2012 Share Posted January 24, 2012 Yes, but what I mean is I personally tend to avoid using arrays entirely.Learn when, and when not, to use them and they will serve you well, young ... erm. "Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions.""The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014) "Believing what you know ain't so" ... Knock Knock ... Link to comment Share on other sites More sharing options...
Quinch Posted January 24, 2012 Author Share Posted January 24, 2012 I beg your pardon. Do any of you guys seriously expect MD5 collisions on random 10k files? I mean spurious MD5 collisions on non-cooked data/executable, real-world files?Not sure if sarcastic {of if I'm not awake enough to read this right} but check the MD5 index file attached a few posts up - there's a couple of MD5 entries that contain files in both gif, jog and png format, so... I guess yes? Link to comment Share on other sites More sharing options...
jchd Posted January 24, 2012 Share Posted January 24, 2012 (edited) No sarcastic at all. To be honest, I didn't look at your index file before.Now it's just plain impossible to collect as many _spurious_ MD5 collision on distinct _non-cooked_ binary files as your index file shows.What that suggests is that there must be something wrong in the MD5s you obtain.Can you post a couple of distinct binary files colliding (please no cheating here)?Please get me right: I'm not saying that MD5 collisions can't be computed, but that requires very (and I mean VERY) special conditions on both input files (and then you're not free to choose the resulting MD5).The odds of random (or so, like jpg, png) 10kb binary files colliding spuriously while being internally completely distinct are unchanged and still much, much less than the odds of having a neutrino interact with your PC and changing a single bit somewhere. And the latter odds are so small that you can bet you life safely on it.What is commonly known is that given an input file A with MD5 hash M[A], it only takes seconds to compute another distinct file B with M[A] = M[A].That's what makes MD5 unsecure for any cryptographic purposes: an opponent may _compute_ a distinct file and exhibit it, claiming it's the original. Turning the counterfeit file with same MD5 into a meaningful file is quite another task which hasn't yet be fully addressed unless we talk 5 to (say) 20 letters passwords or allow for unlimited special characters in text files, for instance. Short inputs like passwords are being more and more tabulated by rainbow tables.What most people don't get is that the odds of a "natural" collision between two _meaningful_ distinct, real-world files (say images) A and B is still extroardinary low. What I say implies we don't put ourselves in a context where we can expect determined opponents, which I understand isn't your case.Files in 10kb range produce MD5 hashes that noone on Earth can distinguish from 10Mb files or greater. In practice, your result is a random MD5 byte sequence. Cryptanalysis of MD5 estimate the practical number of possible MD5 to be somewhere in the 2120 to 2127 range. Let's be overly conservative and only estimate the possible number of hashes to be 2100.Comparing 100 billion of random files isn't likely to cause a single collision in any reasonable time framework. 2100 is nothing less than 1,267,650,600,228,229,401,496,703,205,376 which means we still have a very safe margin before the odds of a natural collision occuring tomorrow exceed the odds of the Sun going nova in the same time.Your situation (comparing contents of image files) boils down to the above, which is called a "birthday attack": accumulate enough real-world data hashes to find a match.Note that no hash standard is 100% immune to such "spurious collide" event, but the longer the hash is bitwise, the smaller the odds are. Anyway, the 128-bit MD5 hash is still perfectly useable for routine file compare with excellent safety, of course unless a malicious hand can get in the process.Edit: I fought a long time against unwanted bold type, sorry! Edited January 24, 2012 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Quinch Posted January 24, 2012 Author Share Posted January 24, 2012 Well, that was definitely an educational post. Anyway, I'll check and send again when I get the chance - I'm at work right now, so it's not an option. If it's of consequence, I'm using the built-in _Crypt_HashData UFD for the hashing. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now