md5sum for each files costly ?!

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Am fine tuning my old script.... I found the following command in my script to find the md5sum,

md5sum filename | cut -f1

Removed the cut from it, and used split in that place by which i avoided thousands of child process creation. ( am i right ?? it will create if i use ? )

As i understand this md5sum will also create new processes for the processed files. At an average this script processes thousands of files, so i have reduced the 2000 process to 1000 process. But i wonder, do i really want to execute md5sum command, or any other way to find out the md5sum value..

Note:

1. i checked for md5sum module, but i dont think any default module is available.. ( pls dont suggest modules available other than standard modules which comes with perl )

2. Or can we use the one md5sum command ( single process to find out md5sum for all files ) to find all files, and then use it my script ?

Comment on md5sum for each files costly ?! Download Code

Replies are listed 'Best First'.
Re: md5sum for each files costly ?! by Corion (Patriarch) on Aug 14, 2010 at 16:20 UTC
See Digest::MD5. Also, Yes, even you can use CPAN, especially local::lib.	[reply]
Re: md5sum for each files costly ?! by repellent (Priest) on Aug 14, 2010 at 20:39 UTC
Digest::file is a core module. `use Digest::file qw(digest_file_hex); my $file = "/some/path/to/file"; my $md5sum = digest_file_hex($file, "MD5");` [download]	[reply] [d/l]
Re^2: md5sum for each files costly ?! by Your Mother (Archbishop) on Aug 14, 2010 at 20:46 UTC
++ I didn't know about this.	[reply]
Re: md5sum for each files costly ?! by zentara (Cardinal) on Aug 14, 2010 at 17:22 UTC
See md5 sum different on windows and unix for win.exe files !!? for how to use perl's md5sum, to avoid forking altogether. If I was doing a program that needed md5sums on thousands of files, I probably would try to setup a permanent worker thread, that would take a file and return it's sum, without forking; like try to setup the c binary in the thread with IPC, and print files to it's STDIN with the '-' option, then collect output thru IPC. That way you are only starting one fork in the child thread....possibly have multiple summing threads to speed it up. If you wanted to loop thru all files for duplicates, you could use File::Find to loop thru them, and store the md5sum in a hash. `#!/usr/bin/perl -w use File::Find; use Digest::MD5 qw(md5_hex); my %same_sized; find sub { return unless -f and my $size = -s _; push @{$same_sized{$size}}, $File::Find::name; }, @ARGV; for (values %same_sized) { next unless (@ARGV = @$_) > 1; local $/; my %md5; while (<>) { push @{$md5{md5_hex($_)}}, $ARGV; } for (values %md5) { next unless (my @same = @$_) > 1; print join(" ", sort @same), "\n"; } }` [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku	[reply] [d/l]