cardozo has asked for the wisdom of the Perl Monks concerning the following question:

Oh great keepers of Perl Wisdom. I come before you humbly today with a question of strategy rather than tactics.

I have a very large amount of files (15,000+) in several directories. The file name of each of the files contains information that I need to process, like:

name-country-language-date.pdf

Currently I end up going through the whole list many times, and it's taking forever.

First I go through and put all of the name entries into a hash. But then for each entry, I have to look again to see what files go together with that one. (files go together by name+country+language, and differ by date).

I do make sure I don't go back over files that have already been looked at, but that doesn't speed things up much at all.

The end result of all this should be a hash with the identifying elements of the file as the key, and the value should be an array of all the files that fit under that name in date order.

Here's how it goes: I go through the output of readdir and match on:
/([a-z0-9]*?-[a-z]{2}-[a-z]{2,3})-(\d{8})(-eol)?\.(pdf|html)$
Then, I open the directory again and look for files that match $1-$2-$3
Then I reverse sort them by the date in the filename, create an array out of that, and put it onto a hash, with the $1-$2-$3 being the key, and the array being the value.

I know this is inefficent, but am at a loss as to what to do better. Any ideas?

Obviously after reading this tale, you'll know that I'm unworthy to receive your assistance, but I beg to receive it.

Replies are listed 'Best First'.
Re: Need help with efficient processing
by jdporter (Paladin) on Jan 06, 2003 at 20:12 UTC
    my %hash; for ( readdir DIR ) # or wherever your filenames come from. { if ( /^([a-z0-9]*?-[a-z]{2}-[a-z]{2,3})-(\d{8})(-eol)?\.(pdf|html)$/ + ) { push @{ $hash{$1} }, $_; } }
    And there you have it.

    Now, if you want to access the keys and filenames in sorted order, you could do --
    for my $key ( sort keys %hash ) { print "Files for $key:\n"; for my $file ( sort @{ $hash{$key} } ) { print "\t$file\n"; } }

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: Need help with efficient processing
by talexb (Chancellor) on Jan 06, 2003 at 20:19 UTC

    I don't know if this is feasible in your situation, but you could do something to make it easier for the OS by rearranging your files.

    The simplest way to explain it is to put all of the files in sub-directories whose name is the first letter of the original file. So A* gets moved to ./A, and so forth. If you want to go to a second level, you can do that too.

    Rewrite your script to use this structure and you will see a performance improvement.

    CPAN's directory structure follows this strategy, as does Yahoo, in some parts of its service.

    --t. alex
    Life is short: get busy!
Re: Need help with efficient processing
by gjb (Vicar) on Jan 06, 2003 at 20:19 UTC

    I may miss the point here, but why not simply building a hash as you go over the directory once.

    foreach (readdir(DIR)) { if (/^([a-z0-9]*?-[a-z]{2}-[a-z]{2,3})-(\d{8})(-eol)?\.(pdf|html)$ +/) { push(@{$files{$1}}, $2); } }
    This can than be sorted in a second pass by:
    foreach (keys %files) { $files{$_} = [sort @{$files{$_}}]; }

    Hope this helps, -gjb-

Re: Need help with efficient processing
by pfaut (Priest) on Jan 06, 2003 at 20:19 UTC

    You should just push the file name into the array referenced by the hash key when you first find the file. There's no need to go back and read the directories a second time. You can then sort the entries in the array when you process the contents of the hash. If you need the arrays to be sorted in the hash, then just go through your hash keys after directory processing and sort each array.

    You might also consider using File::Find or File::Find::Rule instead of readdir.

    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';
Re: Need help with efficient processing
by blokhead (Monsignor) on Jan 06, 2003 at 20:26 UTC
    Why reopen the directory again to look for files matching $1-$2-$3? You are going to see those filenames eventually. Just populate the hash as you go, and let things fall into place. Use a 2nd layer of hashing with the date as the key, then you can sort on the date later as you need. You may want to consider doing a deep hash, with each part of the filename acting as a key at some level. You can still get sorted retrievals given the first 3 components as a key.
    my %files; foreach (readdir(MYDIR)) { if (/([^-]+)-([^-]+)-([^-]+)-([^-]+)\.(pdf|html)/) { # using temp variables for illustrative purposes my ($name, $country, $language, $date) = ($1, $2, $3, $4); $files{$name}{$country}{$langauge}{$date} = $_; # alternatively for a less nested hash, you could do # $files{ join '-', $name, $country, $language }{$date} = $_; } } #### later sub get_files_by_key { my ($name, $country, $language) = @_; my $hash_ref = $files{$name}{$country}{$language}; # alternatively # my $hash_ref = $files{ join '-', $name, $country, $language }; # in any case, sorting them here is easy return map { $hash_ref->{$_} } sort keys %$hash_ref; } my @relevant_files = get_files_by_key('foo', 'us', 'en');
    Obviously after reading this tale, you'll know that I'm unworthy to receive your assistance, but I beg to receive it.
    Hey, this isn't the Internet Oracle. No need to supplicate!

    ZOT,

    blokhead