sjossie has asked for the wisdom of the Perl Monks concerning the following question:

I am a complete beginner to perl so this may or may not be easy.

I have what is effectively a 31,367 word string called $sequences with each "word" separated by a normal space althogh these could be changes to #'s or any other delimiter if it were to make life easier. Some words are up to 26,000 characters long.

What I would like to do is look for the occurrence within any of the separate "words" of two strings $quart1 and $quart2 which may or may not overlap eg. ( cedftghyjhg) $quart1 = dftg $quart2= ftgh would still be a hit.

Is there some way of doing a pattern match across the whole string $sequences rather than dividing it up into single "words".

i.e of the form ...

if ($sequences =~ m/ \s .* (quart1 quart2)allowing for ovelap .* \s / ) {

Does this make any sense ???

many thanks for any tips

Replies are listed 'Best First'.
Re: Pattern matching problem
by Perl Mouse (Chaplain) on Oct 12, 2005 at 15:26 UTC
    #!/usr/bin/perl use strict; use warnings; my $quart1 = "dftg"; my $quart2 = "ftgh"; while (<DATA>) { print if /(?:^|\s)(?=\S*$quart1)(?=\S*$quart2)/; } __DATA__ cedftghyjhg foo cedftghyjhg bar dftg ftgh
    Perl --((8:>*
Re: Pattern matching problem
by Roy Johnson (Monsignor) on Oct 12, 2005 at 15:32 UTC
    No clear way of finding two matches within a word within a series of words occurs to me (although Perl Mouse came up with one). I think the straightforward approach of splitting into words and then doing two checks per word is the way to go.
    for (split ' ', $sequences) { if (/\Q$quart1/ and /\Q$quart2/) { print "Found both in $_\n"; } }
    The one issue to be concerned about is that the split could fill up memory. An alternative way to extract the words one at a time would be:
    while (my ($word) = $sequences =~ /(\S+)/g) { if (index($word, $quart1) >= 0 and index($word, $quart2) >= 0) { print "Found both in $word\n"; } }

    Caution: Contents may have been coded under pressure.
Re: Pattern matching problem
by Zaxo (Archbishop) on Oct 12, 2005 at 15:47 UTC

    For literal strings, I like index,

    my ($loc, %location) = -1; for ($quart1, $quart2) { # or better, @quart while (-1 != $loc = index $sequences, $_, $loc + 1) { push @{$location{$_}}, $loc; } }
    There is some tricky fenceposting there with $loc. The %location hash will wind up containing all the locations in the string of each $quart which is found at least once. You will be able to test if a string is present at all with exists $location{$str}.

    This code doesn't care at all about overlap, will easily find the presence of both in those cases.

    After Compline,
    Zaxo