Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Duplicates in Directories

by kel (Sexton)
on Oct 09, 2017 at 08:02 UTC ( [id://1200969]=perlquestion: print w/replies, xml ) Need Help??

kel has asked for the wisdom of the Perl Monks concerning the following question:

To begin with, I have directories with thousands of files, such as etexts. Many of them are duplicates, but IN DIFFERENT FILE FORMATS.

For example, a directory with 10 thousand files /FOO - might have:
baz.txt
baz.epub
baz.doc
baz.pdf
bar.epub
boo.epub
boo.txt

Now, my first priority is, say, epub.

I have written a script that:
1. Parses all the files in the current directory (or any specified one) and dumps their file.ext names into an array, say @allfiles.
2. I also parse individual extensions into their own arrays, say - @txt @pdf @pdf @epub
3. Then I parse an externsion array, say @pdf, against @allfiles, and only select the pdf filenames that have a corresponding epub extension, and move them to a subdir.

The script works, but has raises some questions:

a. It is glacially slow. It takes a few hours to parse 10k+ files. Is there a more effcient method? Or a module that can handle this type of operation. I have to humbly admit, I do not even know what this type of process is called. I am probably overlooking something painfully obvious here.

b. In the extension parsing process I filter out extraneus characters and spaces so that $a exactly equals $b. if ($a eq $b) - works, as expected.
But (if $a =~ /$b/ ) does not. Is there something I am overlooking here? I prefer matching to equality, as some 'dups' might have minor variations in characters.

Many thanks n advance :)

K

Replies are listed 'Best First'.
Re: Duplicates in Directories
by swl (Parson) on Oct 09, 2017 at 10:21 UTC

    If I'm reading your description correctly, you are passing over the list of files two or three times (possibly doing steps 1 & 2 in one pass). Step 3 also does a linear search over the file names, which is going to go quadratic in terms of processing.

    Maybe try to get all the info in one pass, store the details in a hash structure, and then iterate over that hash. That will take care of the need for exact matches and avoid linear searches.

    use strict; use warnings; # I assume the actual code will obtain this list using # glob or similar my @allfiles = qw / baz.txt baz.epub baz.doc baz.pdf bar.epub boo.epub boo.txt /; my %by_ext; # %by_fbase is actually redundant below, # but is maybe useful for other things # so I have left it in my %by_fbase; foreach my $file (@allfiles) { # should use a proper filename parser here # like File::Basename, but a split will serve # for the purposes of an example. my ($name, $ext) = split /\./, $file; $by_ext{$ext}{$name} = $file; $by_fbase{$name}{$ext} = $file; } foreach my $name (keys %by_fbase) { #print "$name\n"; no autovivification; # could use exists in this check if you want to avoid autoviv, # but file names should evaluate to true if they have # an extension, even if the name part evaluates to false if ($by_fbase{$name}{pdf} && $by_fbase{$name}{epub}) { # do stuff print "$name has epub and pdf extensions: " . "$by_fbase{$name}{epub} $by_fbase{$name}{pdf}\n"; # now do stuff like moving files since you can iterate over # the values of the relevant subhash foreach my $file (values %{$by_fbase{$name}}) { print "now do something to $file\n"; } } }
    That code prints:
    baz has epub and pdf extensions: baz.epub baz.pdf now do something to baz.epub now do something to baz.doc now do something to baz.txt now do something to baz.pdf

    Update: Edited incomplete comment starting with "now do stuff"

      First, thank you all for your suggestions. The problem has been one of algorythym. I am iterating @select files from and @allfile loop, and parsing for equality conditions.

      As the actual code is over 300 lines, I have included an edited snippet. This code is derived from an earlier script where I needed to parse for reexes in files, not necessarily exact matches , and not necesaarily at the beginning. parsing @selectexpr against @allfiles made sense there.

      Hashes are an excellent idea. with them I can parse foo-bar-baz.doc as as hash directly against all foo keys, with proper splitting and filtering, of course. This would allow me to scale up more efficiently.

      I would howver prefer, if possible to keep the matching to a regexp rather an an equality operator.


      Please ignore syntax errors in the code below, it has been abbreviated.

      if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

      For an author-title pair,the matching would be done in the title(value) portion rather than the key, which would be expected to identical (though there might be exceptions ).

      I need to hit the books on hashes here, as i havent really dealt much with them outside of a 20,000+ listing database with about 2 dozen hash fields.

      opendir(DIR, $dir2 ) or die $!; while ( $file = readdir(DIR)) { if (-f $file) { # read only files chomp($file); $file =~ s/^\s+|\s+$//g; $filenam = "" ; push ( @srcarray, $file) ; if ($file =~ m/\.mobi$/ig ) { &typefiles($file, "mobifile"); } if ($file =~ m/\.azw3$/ig ) { &typefiles($file, "azw3file"); } sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; if ($filetype eq "mobifile" ) { push ( @mobiarray, $file) ; } # End mobifiles # Main body - parsing directory listing and performing actions foreach $authf (@srcarray){ if ($authf =~ m/\.pl$/) { next; } if ($authf =~ m/\.epub/ig ) { our $authf2 = $authf ; foreach my $myfilt (@mobiarray){ my $mymobi = $myfilt; my $myepub = $authf2; $mymobi = &extfilter($mymobi); $myepub = &extfilter($myepub); sub extfilter($line) { ($line) = @_; $line =~ s/\.mobi//ig ; $line =~ s/\.epub//ig ; $line =~ s/^\s+|\s+$//g; $line = lc $line; return $line; }
        if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

        No sample data means no solution. Here's the SSCCE you could have provided:

        use strict; use warnings; use Test::More tests => 2; my $mymobi = 'Hello World!'; my $myepub = 'Hello World!'; ok ($mymobi =~ m/($myepub)/); ok ($mymobi eq $myepub);

        See how both the string equality and the regular expression matches are true? So they both "work". Your task is now to provide the values for $mymobi and $myepub for which one or other doesn't match. At that point it should become clear to you what the difference between an exact string match and a regular expression match is (and why one or the other is preferable in different situations - because they serve different purposes).

        You're welcome. However, it is unclear to me why, given you want to use regexp matching, your regexp match apparently does not work and exact equality does:

        if ($mymobi =~ m/($myepub)/) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Does NOT work if ($mymobi eq $myepub) {print "DUPLICATE FOUND !\n" ; &movetodir($myfilt,$dupdir ); } #Works

        The regexp will match anything containing your title, so for a title like "blert" you will be matching all of "blert", "blertblartblort", "foobarblertbaz" etc. Perhaps you need to filter the file names for possible partial matches when you read them? Or if you know there are spelling errors then have a look at Text::Fuzzy and similar. Even then you would perhaps be best to flag them somewhere for cleanup or modification before automated processing.

        Some other points are:


        There is no need to call your subroutines using the &foo() notation unless your perl is very old. foo() will work fine in your case.


        You seem not to really be using subroutine signatures, so

        sub typefiles( $tfile , $filetype ) { ($tfile, $filetype ) = @_ ; #etc... }

        can simply be

        sub typefiles { ($tfile, $filetype ) = @_ ; # etc... }
Re: Duplicates in Directories
by hippo (Bishop) on Oct 09, 2017 at 08:16 UTC
    It is glacially slow. It takes a few hours to parse 10k+ files.

    There is nothing in your description of your algorithm which suggests it should take anything more than a few minutes at the most to run and move 10k files, unless where you are moving them to is on a different filesystem or there's some other important detail which has been omitted. Did you profile your script to find where the bottleneck is?

    if ($a eq $b) - works, as expected. But (if $a =~ /$b/ ) does not.

    Define "does not". One is a string equality while the other is a regular expression match. They are designed to perform different tasks. Have a read through How to ask better questions using Test::More and sample data which might help us to help you. Clearly here the way you have written them the second expression has dodgy parenthetical syntax but that's probably just a typo in your post.

Re: Duplicates in Directories
by haukex (Archbishop) on Oct 09, 2017 at 08:17 UTC
    Is there a more effcient method?

    Probably, but without seeing any code we'd just be guessing. Perhaps you've got some nested loops that could be implemented more efficiently, like with a hash lookup. If you could post a Short, Self-Contained, Correct Example, we could make suggestions for optimizing it.

    I prefer matching to equality, as some 'dups' might have minor variations in characters.

    It sounds like that is not something you will be able to do with regexes, but there are Perl modules available to help you, see for example Edit distance between two strings for some suggestions.

Re: Duplicates in Directories
by Marshall (Canon) on Oct 09, 2017 at 21:26 UTC
    I don't know why this is taking so long.

    Even on Windows XP, a directory with 10K or 20K files is no big deal. With the Windows NTFS file system there are very good reasons not to put a lot of files in the "root C:\" directory, but you aren't doing that. A sub-directory can have 50K files with no problem.

    I would read the "target directory" and then code what I call an "execution plan". Moving files is a "destructive operation" because it modifies the input data. Copying files is not destructive, but takes longer.

    Anyway, I would code the basic algorithm and leave the actual file moving or copying to a final step. I often code a constant like use constant ENABLE_MOVE => 0; I run the code to make sure that it is going to do what I want before I turn that variable "on".

    Your code should just take some seconds to decide what to do. Take the actual move or copy out of the equation until you have an efficient algorithm. Below I just print an intention of what would happen. Get that working efficiently then "turn on" the actual file operation(s).

    #!/usr/bin/perl use strict; use warnings; use Data::Dump qw(pp); $|=1; #turn off buffering to stdout for debugging my %HoH; #{extension}{name} while (my $full_name = <DATA>) { next if $full_name =~ /^\./; # skip names beginning with dot my ($name, $ext) = $full_name =~ /([\w.]+)\.(\w+)$/; next unless defined $ext; # skip bare names wihout .extension $HoH{$ext}{$name}=1; } pp \%HoH; foreach my $pdf_file (keys %{$HoH{pdf}}) { if (exists $HoH{epub}{$pdf_file}) { print "do something with $pdf_file.pdf and $pdf_file.epub\n"; } } =prints { doc => { baz => 1 }, epub => { bar => 1, baz => 1, boo => 1 }, pdf => { baz => 1 }, txt => { "baz" => 1, "boo" => 1, "some.long.name" => 1 }, } do something with baz.pdf and baz.epub =cut __DATA__ . .. some.long.name.txt baz.txt baz.epub baz.doc baz.pdf bar.epub boo.epub boo.txt barefile
Re: Duplicates in Directories
by tybalt89 (Monsignor) on Oct 11, 2017 at 16:20 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1200969 use strict; use warnings; # item 3. for each NAME.epub move the corresponding NAME.pdf to subdir +/ s/epub$/pdf/ and rename $_, "subdir/$_" for <*.epub>;
Re: Duplicates in Directories
by Anonymous Monk on Oct 09, 2017 at 08:15 UTC
    Direct ories dont like to have 10k files that's why its slow.
      Modern filesystems can handle big directories just fine. But hey, maybe the OP is running Windows 3.1.

        Modern filesystems can handle big directories just fine. But hey, maybe the OP is running Windows 3.1.

        What is modern? What is just fine?

Re: Duplicates in Directories
by Anonymous Monk on Oct 09, 2017 at 14:17 UTC

    I think that you are overlooking an obvious optimization.   If the file-names come to you in alphabetical order, as they most-commonly do, then all occurrences of baz.anything will necessarily be consecutive.   Simply split each filename by "." into two pieces (filename, extension), and notice if the name is different from the previous filename you encountered (or if it is the very first one).   In this case, test the list of extensions that you had been accumulating to see if both .epub and .pdf are present in that list, then reset the list.   You never have to “search” for anything, nor do you ever need to store more than two names:   “this” one, and the “immediately previous” one.

      If the file-names come to you in alphabetical order, as they most-commonly do, then all occurrences of baz.anything will necessarily be consecutive.

      This is not something one should rely on, it varies wildly depending on OS and API used to list the files. And even if one sorted the list, it would not help for the OP's second requirement, "some 'dups' might have minor variations in characters".

      Oops, didn't see Corion's node before posting.

      Whatever you mean by "most-commonly" it is certainly not something that a program should rely on.

      Luckily for you and the OP, Perl comes with a built-in facility to sort a list of filenames. Using this facility makes it easy to ensure that a list is sorted.

        Well, for a given operating system (and possibly file system), either they do come in alphabetical order, or they don't. If they do for the OP's system, then presumably the OP can rely on that feature (although there can be some issues with the upper or lower case of the file names). And, BTW, I've just checked on the three different systems available to me (*nix, VMS and Windows), glob returned the names of the files in the directory in alphabetical system for all three of them, so, yes, it is a rather common feature.

        Then, of course, as you rightly said, if they don't come in alphabetical order, or if there is any doubt, it is just as easy to use the Perl sort facility, and it will only take a few split seconds with 10,000 files.

        The idea of sorting data to get better performance (avoiding lookups) is sometimes very efficient. I'm doing it quite commonly in a slightly different context, to compare pairs of very large files that would not fit in a hash: sorting both files on the comparison key (using the *nix sort utility), and then reading both files in parallel in my Perl program to detect records missing from either file or differences in attributes of records having the same comparison key.

      Auto-logged-out again.   Oh well.   Just one more thing:   remember to do this once more when you reach the end of the file, because that is of course what marks the end of the final group of names.

        Oh yeah, and one more thing:   this strategy assumes a case-sensitive file system or that the case of the file-names does not actually vary.   Or that you have the easy means to obtain a case-insensitive sorted filename list.   But the essential algorithm remains the same.   The filesystem’s doing all the hard work for you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1200969]
Approved by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-03-29 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found