in reply to Re: Find number of short words in long word
in thread Find number of short words in long word

I don't know if it is important for the OP, but your code doesn't find overlapping instances:

my $count =()= 'abrabrabrabra' =~ /brabra/g; print "Found $count instances\n" # prints Found 2 instances (not 3)

Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Replies are listed 'Best First'.
Re^3: Find number of short words in long word
by moritz (Cardinal) on Jul 14, 2009 at 20:50 UTC
    ... and if it's something you need to fix, you can do so easily by packing everything but the first character into a look-ahead:
    my $count =()= 'abrabrabrabra' =~ /b(?=rabra)/g;

    Of course in this case you know that the first possible overlap starts at the second be, so even /bra(?=bra)/ works fine.

    (I haven't benchmarked it, but I suppose that non-look-around literals are a bit faster, due to optimizations regarding the match length).

      No need to split the string-to-be-searched-for up into first character/rest of the characters if the capture group is wrapped in a look-ahead.
      >perl -wMstrict -le "my $string = 'aBrabRabrAbra'; my $pattern = qr{ brabra }xmsi; my $count =()= $string =~ m{ (?= ($pattern)) }xmsg; print $count; my @matches = $string =~ m{ (?= ($pattern)) }xmsg; print qq{@matches}; " 3 BrabRa bRabrA brAbra
Re^3: Find number of short words in long word
by sedm1000 (Initiate) on Jul 14, 2009 at 20:54 UTC
    Thanks greatly for the start on this... All *possible* combinations are important - certainly over 20K sequences they'll likely all appear at least a couple of times. Overlapping instances are important - so "AGCTGT" would need to be scored;
    AGCT GCTG CTGT TGTA etc. AGCTGT 1 1 1 0
    and so on...

      Overlapping instances of different patterns would match fine. You'd be searching them separately

      The problem is when the last few characters of a search pattern are the same as the first few characters, and two matches of the same pattern could overlap... it is those cases where you need the lookaheads.


      'ACTACTA' for example; when searching for 'ACTA', should that score two matches or just one?

      If you want it to be two, you need the lookaheads. If you want it to be just one match, then the regex pattern is simply 'ACTA', but it sounds like you want the lookaheads.