Find dup files and display the output of the script in /tmp dir

anthonyraj75 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find dup files and display the output of the script in /tmp dir by biohisham (Priest) on Nov 29, 2009 at 17:35 UTC
First, Welcome to PerlMonks and wish you a nice Perl journey. Update your post and show us that code, and what you've been trying to do to reach your goal, the point is to get a hands-on practical learning experience by interacting with Monks in the Monastery and we all care for everyone who shows to possess some learning potential.. check File::Compare, File and Directory Functions... Best of Luck.... Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l]
Re^2: Find dup files and display the output of the script in /tmp dir by anthonyraj75 (Initiate) on Nov 30, 2009 at 11:10 UTC
Thanks biohisham. Here is the below code: #!/usr/bin/perl -w # usesage: dupDisplay.pl fileMD5.txt [remove] # input file has the following form: # 8e773d2546655b84dd1fdd31c735113e 304048 /media/PICTURES-1/my +media/pictures/pics/20041004-kids-camera/im001020.jpg im001020.jpg # e01d4d804d454dd1fb6150fc74a0912d 296663 /media/PICTURES-1/my +media/pictures/pics/20041004-kids-camera/im001021.jpg im001021.jpg use strict; use warnings; my %seen; my $fileCNT = 0; my $origCNT = 0; my $delCNT = 0; my $failCNT = 0; my $remove = 'remove' if $ARGV[1]; $remove = '' if !$ARGV[1]; print "\n\n ... running in NON removal mode.\n\n" if !$remove; open IN,"< $ARGV[0]" or die ".. we don't see a file to read: $ARGV[0]" +; open OUT,"> $ARGV[0]_new.temp" or die ".. we can't write the file: $AR +GV[0]_new.temp"; open OUTdel,"> $ARGV[0]_deleted" or die ".. we can't write the file: $ +ARGV[0]_deleted"; open OUTfail,"> $ARGV[0]_failed" or die ".. we can't write the file: $ +ARGV[0]_failed"; print "\n ... starting to read find duplicats in: $ARGV[0]\n"; if(! -d './trash/'){mkdir './trash/' or die " !! couldn't make trash d +irectory.\n $! \n";} while(<IN>){ my $line = $_; chomp $line; $fileCNT++; my ($md5,$filesize,$pathfile,$file) = split /\t+/,$line,4; if(exists $seen{"$md5:$filesize"}){ my $timenow = time; my $trashFile = './trash/' . $file . "_" . $timenow; # moves dup +licate file to trash with timestamp extension. #if( ! unlink($pathfile){print OUTfail "$pathfile\n"; $failCNT+ ++;} if($remove){if( ! rename $pathfile,$trashFile){print OUTfail "$pa +thfile\n"; $failCNT++;}} $seen{"$md5:$filesize"} .= "\n $pathfile"; $delCNT++; print " files: $fileCNT originals: $origCNT files to delete: $d +elCNT failed: $failCNT \r"; }else{ $seen{"$md5:$filesize"} = "$pathfile"; printf OUT ("%32s\t%8d\t%s\t%s\n", $md5,$filesize,$pathfile,$file +); $origCNT++; print " files: $fileCNT originals: $origCNT files to delete: $d +elCNT failed: $failCNT \r"; } } foreach my $key (keys %seen){ print OUTdel " $seen{$key}\n"; } print " files: $fileCNT originals: $origCNT files to delete: $delCNT + failed: $failCNT \n\n"; [download] This code is to find all the dup files and move to ./trash directory, but I wanted: 1. The dup files to remain as it is and 2. Display the dup files with the path and filesize in ./tmp directory without moving/removing any files.	[reply] [d/l]
Re^3: Find dup files and display the output of the script in /tmp dir by gmargo (Hermit) on Nov 30, 2009 at 14:17 UTC
If you're going to copy code from another node (Re^2: Find duplicate files.) then at least use the download link, and not just copy/paste with the mouse. Then you won't have all those plus signs in there that show you have not even compiled ("perl -c file...") the downloaded code. If you want to start with that code, then all you'd have to do is to changed the `rename` statement so that it calls a non-damaging function instead. Then tweak to your heart's delight. Give It A Try.	[reply] [d/l]
Re^3: Find dup files and display the output of the script in /tmp dir by biohisham (Priest) on Nov 30, 2009 at 16:11 UTC
If you don't provide $ARGV[1] (i.e 'remove') on the command line then the code would not enforce removal of any duplicate files...I've changed a little bit in the design of the code to make it easier for me when reading from a fileMD5.txt instead of having to provide for it through $ARGV[0].. I could not see the reason why a data structure like "`$seen{"$md5:$filesize"}`" should be used so instead I made the hash key be the filename only, considering that a duplicate file is identified by its name... The test text file 'fileMD5.txt' contains the following data: 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg [download] Here is just an illustration of a concept that you can incorporate, it gives you a list of unique files on the STDOUT console and a list of those duplicates would be printed onto a temp file, my approach is open for criticism, modification and comments for I am a learner myself, best of luck: use strict; use warnings; my $fileCNT = 0; my $dupCNT = 0; my $origCNT = 0; my %seen; my %duplicates; open IN,"< fileMD5.txt" or die ".. we don't see a file to read\n"; print " ... starting to read find duplicats in fileMD5.txt.....\n\n\n" +; while(<IN>){ my $line = $_; chomp $line; $fileCNT++; my ($md5,$filesize,$pathfile,$file) = split /\s+/,$line; if(exists $seen{"$file"}){ $dupCNT++; push @{$duplicates{"$file"}}, ([$pathfile,$filesize]); }else{ push @{$seen{"$file"}}, ([$md5,$pathfile,$filesize]); $origCNT++; } } open (TEMP, '>>', "_Duplicate.temp") or die("can not create temporary +file\n"); print TEMP "FilePath",' ' x 40,"File Size\n\n"; for my $key(keys %duplicates){ my $lineNo=1; for my $info(@{$duplicates{$key}}){ print TEMP "@{[$lineNo++]}:@$info\n\n\n"; } } print "\U\t\toriginal files:\E\n\n"; print "FileName",' 'x20,"Location",' 'x 20,"size\n"; print "_" x 70,"\n"; for my $key(keys %seen){ my $lineNo = 1; for my $info(@{$seen{$key}}){ print "@$info\n"; } } print "\n\n\USummary\E:files: $fileCNT originals: $origCNT files dup +licated: $dupCNT.\E\n\n"; #use Data::Dumper; #print Dumper(\%duplicates); #print Dumper(\%seen); [download] Output to console: `... starting to read find duplicats in fileMD5.txt..... ORIGINAL FILES: FileName Location size ____________________________________________________________________ 8e773d2546655b84dd pics/20041004kidscamera/im001020.jpg 304048 e01d4d804d454dd1fb pics/20041004kidscamera/im001021.jpg 296663 SUMMARY:files: 12 originals: 2 files duplicated: 10.` [download] Output to temp file: `FilePath File Size 1:pics/20041004kidscamera/im001020.jpg 304048 2:pics/20041004kidscamera/im001020.jpg 304048 3:pics/20041004kidscamera/im001020.jpg 304048 4:pics/20041004kidscamera/im001020.jpg 304048 5:pics/20041004kidscamera/im001020.jpg 304048 6:pics/20041004kidscamera/im001020.jpg 304048 1:pics/20041004kidscamera/im001021.jpg 296663 2:pics/20041004kidscamera/im001021.jpg 296663 3:pics/20041004kidscamera/im001021.jpg 296663 4:pics/20041004kidscamera/im001021.jpg 296663` [download] Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]
Re: Find dup files and display the output of the script in /tmp dir by holli (Abbot) on Nov 29, 2009 at 15:11 UTC
/me suggests you at least try to do you homework alone. holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^2: Find dup files and display the output of the script in /tmp dir by anthonyraj75 (Initiate) on Nov 29, 2009 at 16:50 UTC
if I knew I wouldn't posted a question for help - Thanks holli Can someone please suggest?	[reply]
Re^3: Find dup files and display the output of the script in /tmp dir by gmargo (Hermit) on Nov 29, 2009 at 17:23 UTC
I suggest you give it a try. Post code that you've written when you run into a problem. You can look around on Perlmonks and gather snippets to do the job and practically not have to think at all. This problem has been addressed by others many times, and as recently as two days ago, for instance in the thread file comparison using file open in binary mode.. There is no one perfect way. The rule in Perl is "There's More Than One Way To Do It". However, the only way to learn to write code is to write code. Give it Old College Try.	[reply]
Re: Find dup files and display the output of the script in /tmp dir by ambrus (Abbot) on Nov 30, 2009 at 16:21 UTC
What is this code "lemming's" had posted? Could you give a link? Alternately, just search perlmonks for code the searches duplicate files, there's bound to be plenty of it. Hmm, actually I did the search and there's not that many of them posted. Still, there are lots on the internet outside of the monastery too. Find duplicate files. (though you'll have to modify it to actually list the paths), Recursive search for duplicate files (searches for files with the same name)\ Replace duplicate files with hardlinks Removing Duplicate Files Just `find -exec md5sum -- {} + \| sort \| uniq -D` (Some of the above might be duplicates.)	[reply] [d/l]
Re^2: Find dup files and display the output of the script in /tmp dir by Anonymous Monk on Dec 06, 2009 at 08:58 UTC
lemming's code is posted at: http://perlmonks.org/?node_id=810007	[reply]