Re: Find dup files and display the output of the script in /tmp dir

Replies are listed 'Best First'.
Re^2: Find dup files and display the output of the script in /tmp dir by anthonyraj75 (Initiate) on Nov 30, 2009 at 11:10 UTC
Thanks biohisham. Here is the below code: #!/usr/bin/perl -w # usesage: dupDisplay.pl fileMD5.txt [remove] # input file has the following form: # 8e773d2546655b84dd1fdd31c735113e 304048 /media/PICTURES-1/my +media/pictures/pics/20041004-kids-camera/im001020.jpg im001020.jpg # e01d4d804d454dd1fb6150fc74a0912d 296663 /media/PICTURES-1/my +media/pictures/pics/20041004-kids-camera/im001021.jpg im001021.jpg use strict; use warnings; my %seen; my $fileCNT = 0; my $origCNT = 0; my $delCNT = 0; my $failCNT = 0; my $remove = 'remove' if $ARGV[1]; $remove = '' if !$ARGV[1]; print "\n\n ... running in NON removal mode.\n\n" if !$remove; open IN,"< $ARGV[0]" or die ".. we don't see a file to read: $ARGV[0]" +; open OUT,"> $ARGV[0]_new.temp" or die ".. we can't write the file: $AR +GV[0]_new.temp"; open OUTdel,"> $ARGV[0]_deleted" or die ".. we can't write the file: $ +ARGV[0]_deleted"; open OUTfail,"> $ARGV[0]_failed" or die ".. we can't write the file: $ +ARGV[0]_failed"; print "\n ... starting to read find duplicats in: $ARGV[0]\n"; if(! -d './trash/'){mkdir './trash/' or die " !! couldn't make trash d +irectory.\n $! \n";} while(<IN>){ my $line = $_; chomp $line; $fileCNT++; my ($md5,$filesize,$pathfile,$file) = split /\t+/,$line,4; if(exists $seen{"$md5:$filesize"}){ my $timenow = time; my $trashFile = './trash/' . $file . "_" . $timenow; # moves dup +licate file to trash with timestamp extension. #if( ! unlink($pathfile){print OUTfail "$pathfile\n"; $failCNT+ ++;} if($remove){if( ! rename $pathfile,$trashFile){print OUTfail "$pa +thfile\n"; $failCNT++;}} $seen{"$md5:$filesize"} .= "\n $pathfile"; $delCNT++; print " files: $fileCNT originals: $origCNT files to delete: $d +elCNT failed: $failCNT \r"; }else{ $seen{"$md5:$filesize"} = "$pathfile"; printf OUT ("%32s\t%8d\t%s\t%s\n", $md5,$filesize,$pathfile,$file +); $origCNT++; print " files: $fileCNT originals: $origCNT files to delete: $d +elCNT failed: $failCNT \r"; } } foreach my $key (keys %seen){ print OUTdel " $seen{$key}\n"; } print " files: $fileCNT originals: $origCNT files to delete: $delCNT + failed: $failCNT \n\n"; [download] This code is to find all the dup files and move to ./trash directory, but I wanted: 1. The dup files to remain as it is and 2. Display the dup files with the path and filesize in ./tmp directory without moving/removing any files.	[reply] [d/l]
Re^3: Find dup files and display the output of the script in /tmp dir by gmargo (Hermit) on Nov 30, 2009 at 14:17 UTC
If you're going to copy code from another node (Re^2: Find duplicate files.) then at least use the download link, and not just copy/paste with the mouse. Then you won't have all those plus signs in there that show you have not even compiled ("perl -c file...") the downloaded code. If you want to start with that code, then all you'd have to do is to changed the `rename` statement so that it calls a non-damaging function instead. Then tweak to your heart's delight. Give It A Try.	[reply] [d/l]
Re^3: Find dup files and display the output of the script in /tmp dir by biohisham (Priest) on Nov 30, 2009 at 16:11 UTC
If you don't provide $ARGV[1] (i.e 'remove') on the command line then the code would not enforce removal of any duplicate files...I've changed a little bit in the design of the code to make it easier for me when reading from a fileMD5.txt instead of having to provide for it through $ARGV[0].. I could not see the reason why a data structure like "`$seen{"$md5:$filesize"}`" should be used so instead I made the hash key be the filename only, considering that a duplicate file is identified by its name... The test text file 'fileMD5.txt' contains the following data: 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg 8e773d2546655b84dd 304048 pics/20041004kidscamera/im001020.jpg im +001020.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg e01d4d804d454dd1fb 296663 pics/20041004kidscamera/im001021.jpg im +001021.jpg [download] Here is just an illustration of a concept that you can incorporate, it gives you a list of unique files on the STDOUT console and a list of those duplicates would be printed onto a temp file, my approach is open for criticism, modification and comments for I am a learner myself, best of luck: use strict; use warnings; my $fileCNT = 0; my $dupCNT = 0; my $origCNT = 0; my %seen; my %duplicates; open IN,"< fileMD5.txt" or die ".. we don't see a file to read\n"; print " ... starting to read find duplicats in fileMD5.txt.....\n\n\n" +; while(<IN>){ my $line = $_; chomp $line; $fileCNT++; my ($md5,$filesize,$pathfile,$file) = split /\s+/,$line; if(exists $seen{"$file"}){ $dupCNT++; push @{$duplicates{"$file"}}, ([$pathfile,$filesize]); }else{ push @{$seen{"$file"}}, ([$md5,$pathfile,$filesize]); $origCNT++; } } open (TEMP, '>>', "_Duplicate.temp") or die("can not create temporary +file\n"); print TEMP "FilePath",' ' x 40,"File Size\n\n"; for my $key(keys %duplicates){ my $lineNo=1; for my $info(@{$duplicates{$key}}){ print TEMP "@{[$lineNo++]}:@$info\n\n\n"; } } print "\U\t\toriginal files:\E\n\n"; print "FileName",' 'x20,"Location",' 'x 20,"size\n"; print "_" x 70,"\n"; for my $key(keys %seen){ my $lineNo = 1; for my $info(@{$seen{$key}}){ print "@$info\n"; } } print "\n\n\USummary\E:files: $fileCNT originals: $origCNT files dup +licated: $dupCNT.\E\n\n"; #use Data::Dumper; #print Dumper(\%duplicates); #print Dumper(\%seen); [download] Output to console: `... starting to read find duplicats in fileMD5.txt..... ORIGINAL FILES: FileName Location size ____________________________________________________________________ 8e773d2546655b84dd pics/20041004kidscamera/im001020.jpg 304048 e01d4d804d454dd1fb pics/20041004kidscamera/im001021.jpg 296663 SUMMARY:files: 12 originals: 2 files duplicated: 10.` [download] Output to temp file: `FilePath File Size 1:pics/20041004kidscamera/im001020.jpg 304048 2:pics/20041004kidscamera/im001020.jpg 304048 3:pics/20041004kidscamera/im001020.jpg 304048 4:pics/20041004kidscamera/im001020.jpg 304048 5:pics/20041004kidscamera/im001020.jpg 304048 6:pics/20041004kidscamera/im001020.jpg 304048 1:pics/20041004kidscamera/im001021.jpg 296663 2:pics/20041004kidscamera/im001021.jpg 296663 3:pics/20041004kidscamera/im001021.jpg 296663 4:pics/20041004kidscamera/im001021.jpg 296663` [download] Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]