Zoop has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

There are two directories(for eg. alpha and gamma), which contain numerous xml files . each xml file in alpha corresponds to a file in gamma directory . The naming convention of these two files in two directories have a pattern for eg

File in alpha -> abc_12342_tick.xml.alphaprod

File in gamma -> abc_12342_tick.xml.gammatest

What I am trying to achieve is a hash with these file names paired. here is what I did. @alphafiles and @gammafiles contains the file list from alpha and gamma directory.

@hash{@alphafiles}=(); foreach (keys %hash){ my $patt=substr($_,0,index($_,"xml")); @matched=grep(/^$patt/,@gammafiles); $hash{$_} = $matched[0]; }
This code works fine for a smaller set of files in these directories but for a large set of files the number of iterations increases and it becomes extremely slow like takes 90 mins to pair up 30000 files in each directory. I also tried a simpler approach of sorting @alphafiles and @gammafiles lists first and then pairing them like below

@sortalpha= sort @alphafiles; @sortgamma= sort @gammafiles; $i=0; while($i <= $#sortprod) { $hash{$sortalpha[$i]}=$sortgamma[$i]; $i++; }
This method takes hardly a couple of minutes but I doubt the reliability of this approach.I need to make sure I pair the correct files from the two directories.I seek your wisdom. Is there a way to apply a regex during sorting of the arrays which yields files at the correct positions before clubbing?

Thanks in advance

/zoop

'When You starve With A Tiger, The Tiger always starves last'

Replies are listed 'Best First'.
Re: File pairing from different directories
by Corion (Patriarch) on Sep 07, 2009 at 11:56 UTC

    Your first approach is almost there, except that you're looking through the wrong list, repeatedly, which is what drives up your runtime. Basically, what you want is called the "symmetric difference" between two lists, and perlfaq4 (or perldoc -q difference) tells you the common approach.

    All that's left for you to do is to canonicalize the names to a common form. For example in your case, abc_12342_tick could be a good common form from which both forms can be derived.

    The code in perlfaq4 is:

    @union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@differen +ce }, $element; }

    Personally, I like to know whether an element is only in the left or only in the right side of the comparison, so I usually use code like this:

    @union = @intersection = @left = @right = (); %count = (); my %left; @left{ @array1 } = 1 x @array; my %common; foreach $element (@array2) { if (exists $left{ $element }) { $common{ $element } = delete $left{ $element }; } else { $right{ $element } = 1; }; }; print "Keys only on the left side:\n"; for (keys %left) { print "$_\n"; }; print "Keys only on the right side:\n"; for (keys %right) { print "$_\n"; }; print "Keys found on both sides:\n"; for (keys %common) { print "$_\n"; };

    This only works if you don't have duplicates - the code for handling multiple keys isn't difficult but larger and detracts from the main logic.

      Humble Mönch,

      Thanks for a detailed explanation and quite useful pointers. To be honest , I never realized so much stuff is already there in perlfaq for which I keep wasting my time googling around.

      I have a doubt..When you say "canonicalize the names to a common form" ,Does it mean that determining a pattern that would be used to identify the file names from the array. If so then I am already doing that

      my $patt=substr($_,0,index($_,"xml"));
      and then I grep this pattern in the other array. From the code in perlfaq, I get to know @difference, union of both arrays and intersection between the two arrays. If I have to pair up the values from alpha and gamma like hash{abc_12342_tick.xml.alphaprod} = abc_12342_tick.xml.gammaprod , I still will have to grep for the pattern in any one of the arrays.

      Hope I am not badly missing something here

        By "canonicalize the names to a common form", I mean to convert the names into a common form, throwing away all the parts that identify the environment the files came from. In your case, that means throwing away at least .alphaprod and .gammaprod. Your substring approach does that (and a bit more) - it will throw away xml.alphaprod and xml.gammaprod, which should be OK for your purpose.

        You don't need to use grep, which will look at every element in the other list. By putting each list into a separate hash (I usually call them %left and %right, and mean to have the elements on the left side of the comparison in %left), you allow Perl to skip looking through all elements and to directly see whether an element exists.