[Gambas-user] MD5SUM collisions
Doriano Blengino
doriano.blengino at ...1909...
Sat Oct 25 08:58:11 CEST 2008
Kari Laine ha scritto:
> Hi,
>
> referring to discussion few days back I have now tested md5sum with 540388
> files and got NO collisions - I think. Method I used was to calculate md5sum
> and sha512sum for all those files.
> then I asked from database distinct total values for both fields and they
> come up with the same number. In my thinking if there would have been
> collisions with md5sum then numbers should have differed.
> Problem is that if md5sum and sha512sum would have collisions with the same
> file - but I think that's quite unlikely - is it?
>
Recursive problem:
The chance md5sum tells two files are identical when they are not, is
very low.
The chance sha512sum tells two files are identical when they are not,
is very low.
The chance both md5sum and sha512sum, applied on the same set of
files, fail... could be roughly the product of the two chances above. If
there is a good mathematician on this list, he/she could be more precise.
> I make my backup program to take a file identified by sha512sum and size. I
> am not going to run compare for all the files because that would take a
> month with my machine. Calculating the checksums took about a week because
> there were total 1240333 files. So lot of duplicates which is exactly why I
> want to make the program copy files only once.
>
Just to speak about it - you could make a list of the files ordered by
size, then scan the list from top to bottom and compare two files only
if they have the same size. This should not take too much time. I wonder
if simple unix tool can do that: who knows if /usr/bin/sort can sort a
text file of over a million lines? Or a database can be used, and then
"select order by size"...
But this is to play - I think md5sum is enough. Perhaps you know
Rapidshare, or other file-hosting sites like Filemojo and many other.
They host an incredible high number of files. If a user uploads illegal
material, they "ban" that file, and remember it, so the user can not
upload the same file under a different name. They do that using some
checksumming method, so I guess that that method is good for distinguish
among many billions (or more?) files. Same is for Emule, Edonkey,
Bittorrent, and so on. What identifies uniquely a file is its checksum
(don't remember the exact method). May be that, if you go to
www.emule.com, you can find more informations about this.
Happy data crunching - and let your CPU cool down a bit... :-)
Cheers,
Doriano
More information about the User
mailing list