Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

This goes beyond what diff3 can do. diff3 relies on a "common ancestor" model in order to simplify what it does.

There is no unique solution to this problem. You can go about finding the commonality in too many different ways that could result in different output (as your hinted at).

But one way to approach this problem is to select one list and 'diff' each of the other lists against it. Then you merge (as in, use the merge algorithm) these N-1 'diff's (using position in the 'selected' list to control the merging), pushing out common elements when you merge a matched element and collecting new sublists when merging unmatched chunks. When you find a match M, before pushing it out, any collected sublists need to be diffed and merged and all but trailing unmatched elements pushed out, except that even trailing unmatched elements need to be pushed out for lists that participate in the match M.

For example, start with these lists:

X: a b c j k g e h i Y: a b d e f h i Z: a z c d f g h i

Then diff X/Y and X/Z:

X: a b [c j k g e] h i Y: a b [d e f] h i x: a [b] c [j k] g [e] h i Z: a [z] c [d f] g [ ] h i

Now go down the elements of X, merging as you go. First, 'a' matches so push it out in the places where it matches:

X: a | b [c j k g e] h i Y: a | b [d e f] h i x: | [b] c [j k] g [e] h i Z: a | [z] c [d f] g [ ] h i

Next, 'b' matches, so push it out similarly. But this time we have reached an unmatched chunk in Z so it gets collected and will need to be dealt with when we get forced to:

X: a b | [c j k g e] h i Y: a b | [d e f] h i x: | c [j k] g [e] h i Z: a [z] | c [d f] g [ ] h i

Next comes 'c'. It matches so needs to be pushed out, but before we can do that we have to collect more unmatched chunks (from Y this time):

X: a b | c | [j k g e] h i Y: a b [d e f] | | h i x: | | [j k] g [e] h i Z: a [z] | c | [d f] g [ ] h i

Diffing Y/Z finds no matches. So we don't push out the trailing unmatched items in Y's collected sublist. But we have to push out Z's trailing unmatched item since Z participates in pending match of 'c'. Then we can push out 'c':

X: a b c | [j k g e] h i Y: a b [d e f] | h i x: | [j k] g [e] h i Z: a z c | [d f] g [ ] h i

Next comes 'j' and 'k' which don't match so push them out as unique to X and collect more sublist items for Z:

X: a b c j k | g [e] h i Y: a b [d e f] | h i x: | [e] h i Z: a z c [d f] | g [ ] h i

Next comes 'g', which matches so we want to push it out but we first must collect sublists and diff/merge them:

X: a b c j k | g | [e] h i Y: a b [d e f] | | h i x: | | [e] h i Z: a z c [d f] | g | [ ] h i

The Y/Z diff gives us matches so we do a merge on that, pushing out 'd':

X: a b c j k g | [e] h i Y: a b d [[e] f] | h i x: | [e] h i Z: a z c d [ f] g | [ ] h i

Then we have 'e' unmatched in our selected list so we push it out. Then we push out 'f'. So we pop back and push out 'g' (and there were no trailing unmatched items so our collected lists are all empty now):

X: a b c j k g | [e] h i Y: a b d e f | h i x: | [e] h i Z: a z c d f g | [ ] h i

Next is 'e' which is unmatched so we push it out.

X: a b c j k g e | h i Y: a b d e f | h i x: | h i Z: a z c d f g | h i

Then push out 'h' and 'i':

X: a b c j k g e h i Y: a b d e f h i x: h i Z: a z c d f g h i

To implement this, I'd always 'select' the last list so that $diff[0] is the diff between @{$seq[0]} and @{$seq[-1]}. And I'd make an object that lets me move down a diff one element (of the second list within the diff) at a time and knows how to collect the sublist. The top-level code might look something like this (to simplify the code I've transposed the output matrix from how you expected it):

my @seq= GetSequences(); # ( \@seq0, \@seq1, \@seq2, ... ); my @out= DiffMerge( 1, @seq ); sub DiffMerge { my( $finish, @seq )= @_; my @diff= map { DiffToMerge->new( $_, $seq[-1] ) } @seq[ 0 .. $#seq-1 ]; my @out; for my $i ( 0 .. $#{ $seq[-1] } ) { my @row; my $same= 0; my %sublists; my @flush; for my $d ( 0 .. $#diff ) { for( $diff[$d]->SubList() ) { $sublists{$d}= $_ if $_; } if( $diff[$d]->Same($i) ) { $same++; push @flush, $d if $sublists{$d}; push @row, $diff[$d]->Shift(); } else { push @row, undef; } } if( @flush ) { for my $row ( DiffMerge( 0, values %sublists ) ) { my @subrow; for my $d ( keys %sublist ) { $subrow[$d]= $diff[$d]->SublistOffset() + shift @$ +row; } push @out, \@subrow; } FlushSublist( \@out, $_, $diff[$_] ) for @flush; } if( $same ) { push @row, $i; push @out, \@row; } else { my @r; $r[@diff]= $i; push @out, \@r; } } if( $finish ) { FlushSublist( \@out, $_, $diff[$_] ) for 0 .. $#diff; } return @out; }

(Above code updated.)

- tye        


In reply to Re: Reconciling multiple lists (similar to "merge" in CVS?) (diffN) by tye
in thread Reconciling multiple lists (similar to "merge" in CVS?) by japhy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2024-03-28 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found