Jump to content

SMF - The fastest duplicate files finder... [Updated 2024-Oct-20]


KaFu
 Share

Recommended Posts

Guest KeeForm

Hi Kafu,

thanks, that fixed it.

Regarding speed. I have a proposal how to make your program lightning fast. Instead of traversing through all directories (very slowly) you could look up the file list directly in the MFT. Thus speeding up the search operation many times. For example a search operation for a couple of TB volumes and many files would probably take less than 10-20 seconds .

A pause function would be also nice.

Cheers,

Dave

Link to comment
Share on other sites

Hi Kafu,

thanks, that fixed it.

Perfect :), will change it in next release.

Regarding speed. I have a proposal how to make your program lightning fast. Instead of traversing through all directories (very slowly) you could look up the file list directly in the MFT. Thus speeding up the search operation many times. For example a search operation for a couple of TB volumes and many files would probably take less than 10-20 seconds .

Makes more then sense and sound's promissing... but I don't have a clue how to access the MFT with AU :), have you got an example code or link to follow?

A pause function would be also nice.

That should be easy, will do.

Best Regards

Edited by KaFu
Link to comment
Share on other sites

@Kafu

This is what you find on reading MFT data (records)

Call the api function <DeviceIoControl> with the FSCTL_GET_RETRIEVAL_POINTERS ControlCode to find out, where the MFT is located on the volume ( the MFT can be fragmented ).

Create a volume-handle with the api function <CreateFile>.

Set a pointer to the mft-clusters on the volume. Call the api function <SetFilePointerEx> to do this. You can use the created volume-handle on this function call.

FSCTL_GET_RETRIEVAL_POINTERS

Regards,

ptrex

Link to comment
Share on other sites

Link to comment
Share on other sites

Link to comment
Share on other sites

ah thanks.. havent read id ^_^

a rly great program.. but a bit slow :)

it first checks for how many files and then checks them for duplicates.. thats like the double of the time?

Wouldn´t it be better to do both at the same time.. or propably analyze first 10000 files and then start checking them?

still great :)

Edited by aphesia
Link to comment
Share on other sites

a rly great program.. but a bit slow :)

Not really, have you compared the runtime to other programs using md5 file comparison? I doubt that there's any faster then mine (due to the trick with "fake" md5-short calculation / calculating md5 only on parts of large files and not the full file).

it first checks for how many files and then checks them for duplicates.. thats like the double of the time?

Wouldn´t it be better to do both at the same time.. or propably analyze first 10000 files and then start checking them?

No!

It' fully intentional, that it works the way it does. I first acquire the filesize to pre-select the files to calculate the md5 on. I do this because getting filesize is quick and calculating md5 is slow. BUT only files with same filesize can have same md5 by definition!

Let's asume following scenario:

Files: 100.000
Dup files: 1.200
Files with same ByteSize: 14.000
Evaluate filesize per file: 5ms
Calculate md5-short per file: 20m
Calculate md5-full per file: ~300ms (est. average; is proportional to filesize!)


a) My method (2-run)
100.000 * 5ms + 14.000 * 20ms = 780 seconds or 13 minutes

b) 1-run, md5-short
100.000 * 20ms = 2.000 seconds or 33,33 minutes

c) 1-run, md5-full
100.000 * 300ms = 30.000 seconds or 8,33 hours

Results should be the same for all three approaches.

What I can do (and will do so in next release) is make the initial acquisition of files to analyze optional. This info is for manual comparison of search base only. Last release was into functionality, next is into speed :)...

still great ^_^

Thanks anyhow :huh2:

Edit: Tweaked scenario...

Edited by KaFu
Link to comment
Share on other sites

so here is another "bug" ?

files to analyze: 194000

analyzing: "fiel path here"

potentiol duplicates: 164000

duplicates found 0

runtime: 1:48:00

task: preparing output

its standing on this one file for like 30mins or longer (wasnt at home). proapblybecause of the screensaver?

so there are still 30k files to go but task = preparing output ?

but propably i´m understanding smth wrong.

And i got an idea but i´m not sure how much ressources it would take and if it is rly faster:

At the first scan, it saves in a .txt every filesize + number of files in every folder.

So lets say u have a folder:

c:/programs/itunes/iphone/apps/

and in /apps/ are the only changes from the whole /itunes/ folder.

So with that saved .txt it would check the itunes folder. -> smth changed (different size + more files).

it will check all the files in /itunes/ if they are new or already added in the .txt

It will check all the folders in /itunes/ and will find out that every folder has the same size and number of files as saved in the .txt except the /iphone/ folder.

again it will check all files in /iphone/ and find out which sub-folder has changed. And then only checks the sub-folder which rly changed.

U know what i mean? it would only scan the files from a folder where something rly changed.

Otherwise there might be a folder:

c:/programs/itunes/lala/ -> with 1gb of files.. and it would check ALL the files.. which takes alot of time. But nothing has changed in there anyway?

Or propably u can also get the md5 of a folder and check if this has changed.

I also had a similiar idea with creating a programm which checks the files instantly after the download for duplicates. Like the programm checks the whole computer and saves all files with their extensions in different lists:

.mp3.txt

.exe.txt

.dll.exe

...

and if u download a .mp3 it will check the .mp3.txt if there is a file with the same md5,same size, same song name,... what ever.. just to check if this file is already on the pc.

Hope no one will steal my idea now :).. just wrote my idea to explain u a bit better what i mean with the .txt-save-idea in your programm.

So after writing this.. nothing changed @SMF.. still preparing output and on the same file.. and its not the last file.. there is still one more folder it should check (if it does it by alphabet).

bb

Edited by aphesia
Link to comment
Share on other sites

so here is another "bug" ?

task: preparing output

Your right about the delay in the "Preparing output" phase :), guess it's an error I introduced with v0.4.8.9.2. During that task SMF parses through the obtained results and groups files with same md5 to duplicates. Shouldn't really take that long. Will check it this evening, maybe meanwhile try v0.4.8.6.4 (tweak icons as for the other version).

Thanks for the bug report :) (so much code, so much possibilites to mess up).

U know what i mean? it would only scan the files from a folder where something rly changed.

Otherwise there might be a folder:

c:/programs/itunes/lala/ -> with 1gb of files.. and it would check ALL the files.. which takes alot of time. But nothing has changed in there anyway?

Or propably u can also get the md5 of a folder and check if this has changed.

Yeah, that's about what I planed for the still inactive "Verify" button, that you can perform a scan, save the result and then perform a quick delta-scan (only calc md5 if file is either new OR filesize and/or filedate changed). Will see what effort that takes ^_^...

Cheers

Edited by KaFu
Link to comment
Share on other sites

hmm the delay is about 2 hours now :)

runtime: 2:16:54

and ye i think that saved.thingy is a great feature.. will tell u if i get some more ideas ^_^.. ideas are nice and i have alot of them.. but my scripting skills are too weak :huh2:

but for now i think its already rly great :)

Edited by aphesia
Link to comment
Share on other sites

Changelog v0.4.8.9.2 > v0.4.9.3.2

As promised this update is about speed...

Benchmark                                           v0.4.9.3.2 (new)                v0.4.8.9.2 (old)

Initial Startup time (including fileinstall())              2 sec                   7 sec
Subsequent Startup                                          1 sec                   7 sec

Searching 30.791 pictures in 354 folders
...for min info (uncached by WIN)                           34 sec
...for min info (cached by WIN)                           16 sec                    17 sec

...for duplicates                                           1 min 2 sec            1 min 57 sec 
(13.534 pot. dups, 2.507 dup found)

There was a stupidity I did in parsing the duplicates (introduced in version 4.8.5.0 already). I used the function _SQLite_GetTable2d()! As this one's really slow and only useful for small amounts of data, it really slowed down the whole process immensely.

It wasn't even visible for the user, what SMF was doing. So I swapped to another code at that point to improve performance dramatically and added a visible enumeration counter.

I also assume that the larger the amount of data analyzed, the greater the impact of _SQLite_GetTable2d in slowing down SMF is.

General

  • Fixed (major): Slow parsing of duplicates
  • Fixed (major): Context-menu creation in report was bugged on second call, because context-menu of first instance was not destroyed on close of report correctly
  • Fixed (minor): Desktop Icon bug for Vista fixed, changed to shell32.dll icon
  • Fixed (minor): GIF's are still embedded IE-instances (tried GDI+, but SMF always crashed on exit), but now covered with transparent labels for drag&drop protection
  • Added: PAUSE button to pause the search
  • Added: Combobox for Filesize filter (Bytes / Kilobytes / Megabytes)
  • Added: Detailed input fields for Filetime filters
  • Changed: DirGetSize now optional (Pre-Fetch), only really useful for manual comparison / verification
Next ToDo

  • Switch FileSlider from using Report-Listview call to SQLite statement usage (> major speed improvement expected)
  • Add functionality to still deactivated buttons in report context-menu
Source and Executable are available at

http://www.funk.eu

Best Regards

Edited by KaFu
Link to comment
Share on other sites

  • 1 month later...

I just felt like it's time for another update on SMF... so here it comes :o

Posted Image

Changelog v0.4.9.3.2 > v0.5.0.0.9

General

  • Added: Search with DOS Wildcard expressions
  • Added: Various Drive infos can now be obtained
  • Added: Option to identify true file type (e.g. if extension is false), utilizes TrIDLib.dll
  • Added: Option to extract extensive infos from Media files, utilizes MediaInfo.dll
  • Added: Sample GUI to select and test Media info to extract

    Posted Image

  • Added: Calculate crc32 and sha1, may also be used for duplicate search (crc32 now default => speed)
  • Added: Search for ADS Alternate Data Streams
  • Fixed: Filter for filesize and filedate errors
  • Changed: Progress tab layout now show's much more detailed information

    Posted Image

  • Started - unfinished: Implement CM - Continuous Monitoring functionality to update changes in search results in the background
  • Added: Temporary database can now be in memory OR on disk
  • Added: Last search is now dumped to disk and reloaded on restart of SMF
  • Changed: Added record offset to report, number of totals and show records, buttons to browse through results

    Posted Image

  • Added, fixed, changed: Lot's and lot's of minor details i forgot, check it out...

Next ToDo's

  • Update Help-File
  • Clean Up GUI
  • Finalize CM - Continuous Monitoring functionality

    [-]Can already be activated on "Settings" tab to monitor changes (starts background app "SMF-Watcher.exe"), though the changes are not yet reflected in the result set

  • Finalize Reload of selected directories in TreeView after restart.
  • Deactivate Hotkeys if SMF does not have focus
  • Switch FileSlider from using Report-Listview call to SQLite statement usage (> major speed improvement expected)
  • Add functionality to still deactivated buttons in report context-menu
  • FIX BUGS YOU REPORT... incl. custom UDFs now clearly +10.000 lines of code, lot's of potential for bugs :D
Source and Executable are available at

http://www.funk.eu

Best Regards

Edited by KaFu
Link to comment
Share on other sites

I get this error with new script(s):

SMF_v0_5_0_0_9.source\SMF__Main_v0_5_0_0_9.au3 (8492) : ==> Variable used without being declared.

Is that to be expected with au3?

Sorry, can't reproduce what your problem is. When do you get the error? With new scripts? What new scripts?

Cheers

Link to comment
Share on other sites

I too get an error with both the executable and the source.

The executable gets up to the point in the splash/loading screen where it says 'Populating Folder-Tree', and then pops up with an error: 'Error: Variable used without being declared"

Running the source code in Scite, I get:

SMF__Main_v0_5_0_0_9.au3 (8492) : ==> Variable used without being declared.:

$a_irfanview_version = StringSplit(FileGetVersion($s_irfanview_location), '.')

$a_irfanview_version = StringSplit(FileGetVersion(^ ERROR

Link to comment
Share on other sites

SMF__Main_v0_5_0_0_9.au3 (8492) : ==> Variable used without being declared.:

$a_irfanview_version = StringSplit(FileGetVersion($s_irfanview_location), '.')

$a_irfanview_version = StringSplit(FileGetVersion(^ ERROR

We don't have IrfanView installed.

Damn :D, IrfanView is installed on all my machines by default.

All that is missing is a

Global $s_irfanview_location

declaration at the beginning. Updated the download files...

Thanks for the feedback m8's.

Best Regards

Link to comment
Share on other sites

  • KaFu changed the title to SMF - The fastest duplicate files finder... [Updated 2024-Oct-20]

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...