in reply to Complicated pattern match

After tinkering about for a while trying to make a pattern match produce sensible results, I stepped back and tried to think of the problem at core. What you really want to find is a "longest common subsequence" between the strings. There's a very good module on CPAN which does just this - Algorith::Diff.
#!/usr/bin/perl -w use strict; use Algorithm::Diff qw(traverse_sequences); # construct arrayrefs containing an array of single chars my ($A, $B) = map [/(.)/sg], qw( ATGGAGTCGACGAATTTGAAGAAT xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT ); my $prev = ''; my @seq; traverse_sequences( $A, $B, { MATCH => sub { my ($aidx, $bidx) = @_; if('=' ne $prev) { push @seq, ''; $prev = '='; } $seq[-1] .= $A->[$aidx]; }, DISCARD_A => sub { die "Sequence in A is not fully contained i +n B" }, DISCARD_B => sub { my ($aidx, $bidx) = @_; if('!' ne $prev) { push @seq, ''; $prev = '!'; } $seq[-1] .= $B->[$bidx]; }, }, ); print "@seq\n"; __END__ xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT
Even this falls short on "actual" data though: working against GATAGCATGGAGGCCATCGATAACGCGAATTTGAATTTGAAT it produces G AT A G CAT G G AG GCCA TCGA TAA CG CG AATTTGAA TTT GAAT..

Makeshifts last the longest.