comment on

But doesn't work. Where did I go wrong?

It isn't as simple as that: my original approach relied on treating each ambiguous [ACGT] in the source string as a regexp, but to do that it needs to know a) which is the fixed string and which the regexp, and b) that they are not both regexps.

The simplest extension is as below, but this suffers further on speed - it will probably be too slow if you're dealing with strings a few thousand base pairs long:

sub mismatches {
  my($source, $target) = @_;
  my @sparts = ($source =~ /(\[.*?\]|.)/g);
  my @tparts = ($target =~ /(\[.*?\]|.)/g);
  scalar grep { 
    my($s, $t) = ($sparts[$_], $tparts[$_]);
    $s !~ /\[/ ? ($s !~ /$t/)
    : $t !~ /\[/ ? ($t !~ /$s/)
    : !intersect($s, $t) 
  } 0 .. $#sparts;
}

sub intersect {
  my($s, $t) = @_;
  my %seen = map +($_ => 1), $s =~ /[^\[\]]/g;
  scalar grep $seen{$_}, $t =~ /[^\[\]]/g;
}
[download]

This says: if source is not ambiguous, treat the corresponding fragment of the target as a regexp; else if the target is not ambiguous, treat the source fragment as a regexp; else check a full intersection of the two.

If your strings only include ACGT, a more efficient approach would be to transform each string into a bit vector that sets a bit for each base that may be present:

my %bits = ('A' => 1, 'C' => 2, 'G' => 4, 'T' => 8);
my $source1 = bitwise('[TCG]GGGG[AT]');
my $target1 = bitwise('AGGGG[CT]');
print mismatches($source, $target1), "\n";

sub mismatches {
  my($source, $target) = @_;
  ($source & $target) =~ tr/\0//;
}

sub bitwise {
  my $string = shift;
  join '', map {
    my $char = 0;
    $char |= $bits{$_} for /[ACGT]/g;
    chr($char)
  } $string =~ /(\[.*?\]|.)/g;
}
[download]

Once the strings are transformed into this bitwise representation, checking for mismatches is very fast even with long strings.

Hugo

In reply to Re^3: Mismatch Positions of Ambiguous String by hv
in thread Mismatch Positions of Ambiguous String by monkfan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.