Pattern matching problem

sjossie has asked for the wisdom of the Perl Monks concerning the following question:

I am a complete beginner to perl so this may or may not be easy.

I have what is effectively a 31,367 word string called $sequences with each "word" separated by a normal space althogh these could be changes to #'s or any other delimiter if it were to make life easier. Some words are up to 26,000 characters long.

What I would like to do is look for the occurrence within any of the separate "words" of two strings $quart1 and $quart2 which may or may not overlap eg. ( cedftghyjhg) $quart1 = dftg $quart2= ftgh would still be a hit.

Is there some way of doing a pattern match across the whole string $sequences rather than dividing it up into single "words".

i.e of the form ...

if ($sequences =~ m/ \s .* (quart1 quart2)allowing for ovelap .* \s / ) {

Does this make any sense ???

many thanks for any tips

Comment on Pattern matching problem

Replies are listed 'Best First'.
Re: Pattern matching problem by Perl Mouse (Chaplain) on Oct 12, 2005 at 15:26 UTC
`#!/usr/bin/perl use strict; use warnings; my $quart1 = "dftg"; my $quart2 = "ftgh"; while (<DATA>) { print if /(?:^\|\s)(?=\S$quart1)(?=\S$quart2)/; } __DATA__ cedftghyjhg foo cedftghyjhg bar dftg ftgh` [download] `Perl --((8:>*`	[reply] [d/l]
Re: Pattern matching problem by Roy Johnson (Monsignor) on Oct 12, 2005 at 15:32 UTC
No clear way of finding two matches within a word within a series of words occurs to me (although Perl Mouse came up with one). I think the straightforward approach of splitting into words and then doing two checks per word is the way to go. `for (split ' ', $sequences) { if (/\Q$quart1/ and /\Q$quart2/) { print "Found both in $_\n"; } }` [download] The one issue to be concerned about is that the split could fill up memory. An alternative way to extract the words one at a time would be: `while (my ($word) = $sequences =~ /(\S+)/g) { if (index($word, $quart1) >= 0 and index($word, $quart2) >= 0) { print "Found both in $word\n"; } }` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re: Pattern matching problem by Zaxo (Archbishop) on Oct 12, 2005 at 15:47 UTC
For literal strings, I like index, `my ($loc, %location) = -1; for ($quart1, $quart2) { # or better, @quart while (-1 != $loc = index $sequences, $_, $loc + 1) { push @{$location{$_}}, $loc; } }` [download] There is some tricky fenceposting there with $loc. The %location hash will wind up containing all the locations in the string of each $quart which is found at least once. You will be able to test if a string is present at all with `exists $location{$str}`. This code doesn't care at all about overlap, will easily find the presence of both in those cases. After Compline, Zaxo	[reply] [d/l] [select]