[Gambas-user] MD5SUM collisions

Doriano Blengino doriano.blengino at ...1909...
Sat Oct 25 08:58:11 CEST 2008


Kari Laine ha scritto:
> Hi,
>
> referring to discussion few days back I have now tested md5sum with 540388
> files and got NO collisions - I think. Method I used was to calculate md5sum
> and sha512sum for all those files.
> then I asked from database distinct total values for both fields and they
> come up with the same number. In my thinking if there would have been
> collisions with md5sum then numbers should have differed.
> Problem is that if md5sum and sha512sum would have collisions with the same
> file - but I think that's quite unlikely - is it?
>   
Recursive problem:
  The chance md5sum tells two files are identical when they are not, is 
very low.
  The chance sha512sum tells two files are identical when they are not, 
is very low.
  The chance both md5sum and sha512sum, applied on the same set of 
files, fail... could be roughly the product of the two chances above. If 
there is a good mathematician on this list, he/she could be more precise.

> I make my backup program to take a file identified by sha512sum and size. I
> am not going to run compare for all the files because that would take a
> month with my machine. Calculating the checksums took about a week because
> there were total 1240333 files. So lot of duplicates which is exactly why I
> want to make the program copy files only once.
>   
Just to speak about it - you could make a list of the files ordered by 
size, then scan the list from top to bottom and compare two files only 
if they have the same size. This should not take too much time. I wonder 
if simple unix tool can do that: who knows if /usr/bin/sort can sort a 
text file of over a million lines? Or a database can be used, and then 
"select order by size"...

But this is to play - I think md5sum is enough. Perhaps you know 
Rapidshare, or other file-hosting sites like Filemojo and many other. 
They host an incredible high number of files. If a user uploads illegal 
material, they "ban" that file, and remember it, so the user can not 
upload the same file under a different name. They do that using some 
checksumming method, so I guess that that method is good for distinguish 
among many billions (or more?) files. Same is for Emule, Edonkey, 
Bittorrent, and so on. What identifies uniquely a file is its checksum 
(don't remember the exact method). May be that, if you go to 
www.emule.com, you can find more informations about this.

Happy data crunching - and let your CPU cool down a bit... :-)
Cheers,
Doriano






More information about the User mailing list