[Gambas-user] Binary compare of files?

Kari Laine klaine8 at ...626...
Fri Oct 17 20:57:33 CEST 2008


On Fri, Oct 17, 2008 at 9:11 PM, Ron_1st <ronstk at ...239...> wrote:

> On Friday 17 October 2008, Kari Laine wrote:
> > On Fri, Oct 17, 2008 at 5:44 PM, Kari Laine <klaine8 at ...626...> wrote:
> >
> > > On Fri, Oct 17, 2008 at 12:28 PM, Stefano Palmeri <
> rospolosco at ...152...>wrote:
> > >
> > >> Il venerdì 17 ottobre 2008 10:28:28 Ron_1st ha scritto:
> > >> > On Friday 17 October 2008, Stefano Palmeri wrote:
> > >> > > If you only want to know if two files are identical,
> > >> > > you could use md5sum.
> > >> > >
> > >> > > Ciao,
> > >> > >
> > >> > > Stefano
> > >> >
> > >> > A little misunderstanding of MD5.
> > >> > You  know for _*_sure_*_ if the SUM differs they are not equal.
> > >> >
> > >> > You may _*_assume_*_ they like the same if the sum is equal.
> > >> >
> > >> > As first test to know if you need investigation for 100% the
> > >> > same files it is helpfull.
> > >> > The real test could be done with i.e. a shell to diff or comp
> command.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Best regards
> > >> > Ron_1st
> > >> >
> > >>
> > >> Thanks Ron! I didn't know about this MD5 issue ("collision").
> > >>
> > >>
> > >>
> http://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html
> > >>
> > >> I've always believed that md5sum was 100% safe.
> > >> There's always something to learn.
> > >>
> > >> Ciao,
> > >>
> > >> Stefano
> > >>
> > >
> > > Hi thanks for comments,
> > >
> > > I am doing a backup program. I have collected things (files) for years
> > > (15years) now and have ended with uncontrolled pile of harddisks which
> > > contain backups and backups of backups and partly backups with
> different
> > > names and so on. To clean this mess one feature of my backup program is
> that
> > > it should not backup file twice even if it is with different name.
> Therefore
> > > I turned to md5. But this collisions thing is not good. Is there better
> > > checksumming algorithms with longer fingerprint? Any ideas how to do
> this?
> > >
> > > I cannot use file compare because when later backing up a particular
> disk
> > > the colliding file is not mounted any more on the machine or might be
> > > mounted on different directory. So I need a way to store enough
> information
> > > to make that decision based on information stored in database without
> access
> > > to colliding file. Size might help. Name cannot be used because I have
> ended
> > > up with lot of duplicate files with different names.
> > >
> > > I am now going to test this collision thing. I will checksum 5 TB of
> files
> > > and see how many collisions I get. If it is less than 10 files then I
> don't
> > > care.
> > >
> > >
> > > Best Regards
> > > Kari Laine
> > >
> > >
> > >
> > >
> > Ok should have looked around myself. I found a command sha512sum , which
> > seems to calculate longer finger prints. Is this better in avoiding
> > collisions?
> > How about calculating both md5sum and sha512sum for a file - do
> collisions
> > happen on different places on different methods?
> >
> > thankful for any advice.
> >
> >
> > Best Regards
> > Kari Laine
>
> If there are no problems to keep both I would use it.
> The reason is very simple.
>
> If manual check on files is in need then MD5 is allmost standard
> installed and easy to use to test and is relative quick.
>
> If the md5 is equal you can check with sha to verify they do not
> differ in any aspect at cost of more time to test.
> For sha you may have to install sha package first
>
> These methods never tell you 100% the truth but are safe to indicate
> they are different and not the same.
> The only one 100% is true is byte for byte compare, not nice to
> store in the database. :)
>
> I curious what the test with 5TByte wil do however it is not importand
> how many bytes but more how many files are involved.
>
>
>
>
>
> Best regards
> Ron_1st
>
> It is running ... :-)
It will take couple of days. There are few huge files like like VMWare
virtual machines and ISOs. But I took liberty to include only files which
are IF Stat(hak &/ filename).Size < 62549803 to the test.  That means
probably few millions.  I  wonder how big the database is going to be. Then
I would have good test case to test how much indexes in mysql help
searching.

I don't know how many it is going to be. I will post my findings to this
thread if there is no blackout, my program would not bomb out after a week,
or hardware meltdown :-)
But I think that using three checksums combined with file size would be sure
enough way to judge them equal. This test is just out of curiosity and
probably fail because an error or another reason. Electricity is not very
good here and I don't have UPS (one that I had failed few weeks ago).

As a side comment this programming with Gambas is fun ! If only few areas
would be documented better but I trust I will figure out them later. And
this list is great source.
I still have to do the user interface and on that I am probably going to
need your help. If I ever get this program to production level it is going
to be GPL.


Best Regards
Kari Laine



More information about the User mailing list