in reply to File pairing from different directories

Your first approach is almost there, except that you're looking through the wrong list, repeatedly, which is what drives up your runtime. Basically, what you want is called the "symmetric difference" between two lists, and perlfaq4 (or perldoc -q difference) tells you the common approach.

All that's left for you to do is to canonicalize the names to a common form. For example in your case, abc_12342_tick could be a good common form from which both forms can be derived.

The code in perlfaq4 is:

@union = @intersection = @difference = (); %count = (); foreach $element (@array1, @array2) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@differen +ce }, $element; }

Personally, I like to know whether an element is only in the left or only in the right side of the comparison, so I usually use code like this:

@union = @intersection = @left = @right = (); %count = (); my %left; @left{ @array1 } = 1 x @array; my %common; foreach $element (@array2) { if (exists $left{ $element }) { $common{ $element } = delete $left{ $element }; } else { $right{ $element } = 1; }; }; print "Keys only on the left side:\n"; for (keys %left) { print "$_\n"; }; print "Keys only on the right side:\n"; for (keys %right) { print "$_\n"; }; print "Keys found on both sides:\n"; for (keys %common) { print "$_\n"; };

This only works if you don't have duplicates - the code for handling multiple keys isn't difficult but larger and detracts from the main logic.

Replies are listed 'Best First'.
Re^2: File pairing from different directories
by Zoop (Acolyte) on Sep 08, 2009 at 13:30 UTC
    Humble Mönch,

    Thanks for a detailed explanation and quite useful pointers. To be honest , I never realized so much stuff is already there in perlfaq for which I keep wasting my time googling around.

    I have a doubt..When you say "canonicalize the names to a common form" ,Does it mean that determining a pattern that would be used to identify the file names from the array. If so then I am already doing that

    my $patt=substr($_,0,index($_,"xml"));
    and then I grep this pattern in the other array. From the code in perlfaq, I get to know @difference, union of both arrays and intersection between the two arrays. If I have to pair up the values from alpha and gamma like hash{abc_12342_tick.xml.alphaprod} = abc_12342_tick.xml.gammaprod , I still will have to grep for the pattern in any one of the arrays.

    Hope I am not badly missing something here

      By "canonicalize the names to a common form", I mean to convert the names into a common form, throwing away all the parts that identify the environment the files came from. In your case, that means throwing away at least .alphaprod and .gammaprod. Your substring approach does that (and a bit more) - it will throw away xml.alphaprod and xml.gammaprod, which should be OK for your purpose.

      You don't need to use grep, which will look at every element in the other list. By putting each list into a separate hash (I usually call them %left and %right, and mean to have the elements on the left side of the comparison in %left), you allow Perl to skip looking through all elements and to directly see whether an element exists.