Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Reconciling multiple lists (similar to "merge" in CVS?) (diffN)

by tye (Sage)
on Jan 06, 2007 at 07:22 UTC ( [id://593272]=note: print w/replies, xml ) Need Help??


in reply to Reconciling multiple lists (similar to "merge" in CVS?)

This goes beyond what diff3 can do. diff3 relies on a "common ancestor" model in order to simplify what it does.

There is no unique solution to this problem. You can go about finding the commonality in too many different ways that could result in different output (as your hinted at).

But one way to approach this problem is to select one list and 'diff' each of the other lists against it. Then you merge (as in, use the merge algorithm) these N-1 'diff's (using position in the 'selected' list to control the merging), pushing out common elements when you merge a matched element and collecting new sublists when merging unmatched chunks. When you find a match M, before pushing it out, any collected sublists need to be diffed and merged and all but trailing unmatched elements pushed out, except that even trailing unmatched elements need to be pushed out for lists that participate in the match M.

For example, start with these lists:

X: a b c j k g e h i Y: a b d e f h i Z: a z c d f g h i

Then diff X/Y and X/Z:

X: a b [c j k g e] h i Y: a b [d e f] h i x: a [b] c [j k] g [e] h i Z: a [z] c [d f] g [ ] h i

Now go down the elements of X, merging as you go. First, 'a' matches so push it out in the places where it matches:

X: a | b [c j k g e] h i Y: a | b [d e f] h i x: | [b] c [j k] g [e] h i Z: a | [z] c [d f] g [ ] h i

Next, 'b' matches, so push it out similarly. But this time we have reached an unmatched chunk in Z so it gets collected and will need to be dealt with when we get forced to:

X: a b | [c j k g e] h i Y: a b | [d e f] h i x: | c [j k] g [e] h i Z: a [z] | c [d f] g [ ] h i

Next comes 'c'. It matches so needs to be pushed out, but before we can do that we have to collect more unmatched chunks (from Y this time):

X: a b | c | [j k g e] h i Y: a b [d e f] | | h i x: | | [j k] g [e] h i Z: a [z] | c | [d f] g [ ] h i

Diffing Y/Z finds no matches. So we don't push out the trailing unmatched items in Y's collected sublist. But we have to push out Z's trailing unmatched item since Z participates in pending match of 'c'. Then we can push out 'c':

X: a b c | [j k g e] h i Y: a b [d e f] | h i x: | [j k] g [e] h i Z: a z c | [d f] g [ ] h i

Next comes 'j' and 'k' which don't match so push them out as unique to X and collect more sublist items for Z:

X: a b c j k | g [e] h i Y: a b [d e f] | h i x: | [e] h i Z: a z c [d f] | g [ ] h i

Next comes 'g', which matches so we want to push it out but we first must collect sublists and diff/merge them:

X: a b c j k | g | [e] h i Y: a b [d e f] | | h i x: | | [e] h i Z: a z c [d f] | g | [ ] h i

The Y/Z diff gives us matches so we do a merge on that, pushing out 'd':

X: a b c j k g | [e] h i Y: a b d [[e] f] | h i x: | [e] h i Z: a z c d [ f] g | [ ] h i

Then we have 'e' unmatched in our selected list so we push it out. Then we push out 'f'. So we pop back and push out 'g' (and there were no trailing unmatched items so our collected lists are all empty now):

X: a b c j k g | [e] h i Y: a b d e f | h i x: | [e] h i Z: a z c d f g | [ ] h i

Next is 'e' which is unmatched so we push it out.

X: a b c j k g e | h i Y: a b d e f | h i x: | h i Z: a z c d f g | h i

Then push out 'h' and 'i':

X: a b c j k g e h i Y: a b d e f h i x: h i Z: a z c d f g h i

To implement this, I'd always 'select' the last list so that $diff[0] is the diff between @{$seq[0]} and @{$seq[-1]}. And I'd make an object that lets me move down a diff one element (of the second list within the diff) at a time and knows how to collect the sublist. The top-level code might look something like this (to simplify the code I've transposed the output matrix from how you expected it):

my @seq= GetSequences(); # ( \@seq0, \@seq1, \@seq2, ... ); my @out= DiffMerge( 1, @seq ); sub DiffMerge { my( $finish, @seq )= @_; my @diff= map { DiffToMerge->new( $_, $seq[-1] ) } @seq[ 0 .. $#seq-1 ]; my @out; for my $i ( 0 .. $#{ $seq[-1] } ) { my @row; my $same= 0; my %sublists; my @flush; for my $d ( 0 .. $#diff ) { for( $diff[$d]->SubList() ) { $sublists{$d}= $_ if $_; } if( $diff[$d]->Same($i) ) { $same++; push @flush, $d if $sublists{$d}; push @row, $diff[$d]->Shift(); } else { push @row, undef; } } if( @flush ) { for my $row ( DiffMerge( 0, values %sublists ) ) { my @subrow; for my $d ( keys %sublist ) { $subrow[$d]= $diff[$d]->SublistOffset() + shift @$ +row; } push @out, \@subrow; } FlushSublist( \@out, $_, $diff[$_] ) for @flush; } if( $same ) { push @row, $i; push @out, \@row; } else { my @r; $r[@diff]= $i; push @out, \@r; } } if( $finish ) { FlushSublist( \@out, $_, $diff[$_] ) for 0 .. $#diff; } return @out; }

(Above code updated.)

- tye        

Replies are listed 'Best First'.
Re^2: Reconciling multiple lists (similar to "merge" in CVS?) (diffN)
by tye (Sage) on Jan 06, 2007 at 09:44 UTC

    And here is rather ugly but working code:

    #!/usr/bin/perl -w use strict; use Algorithm::Diff 1.19 (); Main( @ARGV ); exit( 0 ); { package DiffToMerge; sub new { my $class= shift @_; my $diff= Algorithm::Diff->new( @_ ); $diff->Next(); # Start out at first chunk my $me= bless { diff => $diff, offset => 0, }, $class; if( $diff->Diff() ) { $me->{sublist}= [ $diff->Items( 1 ) ]; $me->{suboff}= $diff->Min( 1 ); } return $me; } sub Same { my( $me, $off )= @_; for( $me->{offset} ) { die if $off < $_; die if $_+1 < $off; $_= $off; } my $diff= $me->{diff}; if( $diff->Max(2) < $off ) { die if ! $diff->Next(); if( $diff->Diff() ) { $me->{sublist}= [ $diff->Items( 1 ) ]; $me->{suboff}= $diff->Min( 1 ); } if( ! $diff->Range(2) ) { die if $off <= $diff->Max(2); die if ! $diff->Next(); } die if $off < $diff->Min(2); die if $diff->Max(2) < $off; } return $diff->Same(); } sub Shift { my( $me )= @_; my $diff= $me->{diff}; die if ! $diff->Same(); return $diff->Min(1) + $me->{offset}++ - $diff->Min(2); } sub Sublist { my( $me )= @_; my $sublist= $me->{sublist}; return if ! $sublist || ! @$sublist; return $sublist; } sub SublistOffset { my( $me, $offset )= @_; return undef if ! defined $offset; shift @{ $me->{sublist} }; return $me->{suboff}++; } sub SublistIdxs { my( $me )= @_; my $beg= $me->{suboff}; $me->{suboff} += @{ $me->{sublist} || [] }; $me->{sublist}= []; return $beg .. $me->{suboff}-1; } } sub FlushSublist { my( $avOut, $d, $diff )= @_; for( $diff->SublistIdxs() ) { my @row; $row[$d]= $_; push @$avOut, \@row; } } sub DiffMerge { my( $finish, @seq )= @_; my @diff= map { DiffToMerge->new( $_, $seq[-1] ) } @seq[ 0 .. $#seq-1 ]; my @out; for my $i ( 0 .. $#{ $seq[-1] } ) { my @row; my %sublists; my @flush; for my $d ( 0 .. $#diff ) { my $same= $diff[$d]->Same($i); for( $diff[$d]->Sublist() ) { $sublists{$d}= $_ if $_; } if( $same ) { push @flush, $d if $sublists{$d}; $row[$d]= $diff[$d]->Shift(); } } if( @flush ) { for my $row ( DiffMerge( 0, values %sublists ) ) { my @subrow; for my $d ( keys %sublists ) { $subrow[$d]= $diff[$d]->SublistOffset( shift @$row + ); } push @out, \@subrow; } FlushSublist( \@out, $_, $diff[$_] ) for @flush; } $row[@diff]= $i; push @out, \@row; } if( $finish ) { for( 0 .. $#diff ) { FlushSublist( \@out, $_, $diff[$_] ) if $diff[$_]->Sublist(); } } return @out; } sub Main { my @seq= @_; @seq= qw( abdefhi azcdfghi abcjkgnhi ) if ! @seq; @seq= map [ /./gs ], @seq; my @out= DiffMerge( 1, @seq ); for( @seq ) { push @$_, '-'; } for( @out ) { for( @$_ ) { $_= '-' if ! defined $_; } print "( @$_ )\n"; } print $/; for( @seq ) { my @l= map shift @$_, @out; for( @l ) { $_= -1 if ! defined $_ || '-' eq $_; } print "( @$_[@l] )\n"; } }

    can be used as:

    % perl diffN.pl ( 0 0 0 ) ( 1 - 1 ) ( 2 - ) ( 3 - ) ( 4 - ) ( - 1 ) ( - 2 2 ) ( - - 3 ) ( - - 4 ) ( - 3 ) ( - 4 ) ( - 5 5 ) ( - - 6 ) ( 5 6 7 ) ( 6 7 8 ) ( a b d e f - - - - - - - - h i ) ( a - - - - z c - - d f g - h i ) ( a b - - - - c j k - - g n h i ) % perl diffN.pl tye says japhy asked ( - - 0 ) ( - 0 ) ( - 1 1 0 ) ( - - 2 ) ( - - 3 ) ( 0 - - ) ( 1 2 4 ) ( - 3 - 1 ) ( - - - 2 ) ( 2 - - 3 ) ( - - - 4 ) ( - - - - - t y - - e - ) ( - s a - - - y s - - - ) ( j - a p h - y - - - - ) ( - - a - - - - s k e d )

    Update: And here is a case with a bug to be fixed:

    % perl diffN.pl encyclopedia cyclops enclosure wikipedia cyclone wick +ed lonely ( - - - 0 - 0 ) ( - - - 1 - 1 ) ( 0 - 0 - - - ) ( 1 - 1 - - - ) ( 2 0 2 - 0 2 ) ( - - - 2 - 3 ) ( 3 ) ( 4 ) ( - 1 ) ( - 2 ) ( - - - - 1 ) ( - - - - 2 ) ( 5 3 3 - 3 - 0 ) ( 6 4 4 - 4 - 1 ) ( - - - - 5 - 2 ) ( - - - 3 ) ( 7 5 - 4 ) ( - 6 5 - ) ( - - 6 - ) ( - - 7 - ) ( 8 - 8 5 6 4 3 ) ( - - - - - - 4 ) ( - - - - - - 5 ) ( 9 ) ( 10 ) ( 11 ) ( - - - 6 ) ( - - - 7 ) ( - - - 8 ) ( - - - - - 5 ) ( - - e n c - y c - - - - l o - - p - - - e - - d i a - - - - ) ( - - - - c - - - y c - - l o - - p s - - - - - - - - - - - - ) ( - - e n c - - - - - - - l o - - - s u r e - - - - - - - - - ) ( w i - - - k - - - - - - - - - i p - - - e - - - - - d i a - ) ( - - - - c - - - - - y c l o n - - - - - e - - - - - - - - - ) ( w i - - c k - - - - - - - - - - - - - - e - - - - - - - - d ) ( - - - - - - - - - - - - l o n - - - - - e l y - - - - - - - ) # ^ ^ ^ ^ ^ ^ < should merge > ^ ^ ^ ^ ^ ^ ^

    - tye        

      Playing around with some code for this, there are often multiple ways in which differences could be merged. For example, from your test above:

      encyclo pedia enc losur e #or encyclop edia enc lo sure

      To me, these are both equally legitimate, but is there any criteria upon which the one could be determined to be preferrable to the other?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://593272]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-24 04:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found