Regex match at the beginning or end of string

cyber-guard has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex match at the beginning or end of string by BrowserUk (Patriarch) on Feb 19, 2011 at 00:37 UTC
Try: `[0] Perl> print "$_ : ", /(?=^.fred)(?=^.bill)/ ? 'matched' :'no match' for qw[ fred&bill bill&fred bill&john fred&john john&bill john&fre +d sucker! ];; fred&bill : matched bill&fred : matched bill&john : no match fred&john : no match john&bill : no match john&fred : no match sucker! : no match [0] Perl>` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Regex match at the beginning or end of string by AnomalousMonk (Archbishop) on Feb 19, 2011 at 03:07 UTC
... match either pattern1pattern2 or pattern2pattern1 ... What this (rather than the title) implies to me is "match either pattern1 immediately followed by pattern2, or else pattern2 immediately followed by pattern1". This is not quite what BrowserUk's regex (or `/fred/ && /bill/` for that matter) matches. (Sorry for the line-wrap.) >perl -wMstrict -le "my @strings = qw(nowknow knownow know now no); ;; my $pattern1 = qr{ now }xms; my $pattern2 = qr{ know }xms; ;; my $regex1 = qr{ (?= \A .? $pattern1) (?= \A .? $pattern2) }xms; my $regex2 = qr{ $pattern1 $pattern2 \| $pattern2 $pattern1 }xms; ;; for my $regex ($regex1, $regex2) { print qq{for regex $regex}; for my $string (@strings) { print qq{ '$string' has }, $string =~ $regex ? 'a' : 'NO', ' match'; } } " for regex (?msx-i: (?= \A .? (?msx-i: now )) (?= \A .? (?msx-i: kno +w )) ) 'nowknow' has a match 'knownow' has a match 'know' has a match 'now' has NO match 'no' has NO match for regex (?msx-i: (?msx-i: now ) (?msx-i: know ) \| (?msx-i: know ) ( +?msx-i: now ) ) 'nowknow' has a match 'knownow' has a match 'know' has NO match 'now' has NO match 'no' has NO match [download] Update: In general, the key consideration is not the 'length' of the regex. It's not hard to write a regex of a couple hundred characters that will run for the rest of your life, even if you live to be as old as Methuselah. Conversely, a regex of several thousand characters can run quickly (depending on your definition of 'quick'). One important speed consideration is to reduce the possible starting points in a string from which a match may be attempted. That's what the `^` and `\A` anchors do: a match may only occur at the start of the string. Another approach is to minimize backtracking by making a regex or regex sub-patterns 'atomic' with the `(?>pattern)` construct. See Extended Patterns. See also perlretut and perlrequick.	[reply] [d/l] [select]
Re^3: Regex match at the beginning or end of string by BrowserUk (Patriarch) on Feb 19, 2011 at 09:17 UTC
This is not quite what BrowserUk's regex (or /fred/ && /bill/ for that matter) matches. The main advantage of the lookaheads over multiple regex is that it extends linearly rather than compounding. #! perl -slw use strict; use List::Util qw[ shuffle ]; my @terms = qw[ the quick brown fox jumps over the lazy dog ]; my $re = join'', map "(?=^.$_)", @terms; $re = qr/$re/; for( 1 .. 10) { my $input = join ' ', shuffle @terms; $input =~ $re and print "$input matched"; } __END__ C:\test>junk48 quick brown the dog over fox lazy jumps the matched the dog over jumps the quick lazy brown fox matched jumps brown fox lazy quick the over dog the matched jumps brown dog over the fox lazy the quick matched over dog jumps fox the brown the lazy quick matched dog fox lazy the the quick over brown jumps matched lazy over brown dog quick the fox jumps the matched jumps fox quick brown the over lazy dog the matched over dog the lazy jumps quick brown fox the matched dog over lazy quick the the jumps brown fox matched [download] That is considerably easier than constructing and trying all 350,000+ regex. Another advantage is that it only takes a small tweak to deal with the situation where not just the ordering is uncertain, but also some terms may be omitted. With the nice side-effect that you can use capturing to find out what was matched because the captures will be returned in a consistent ordering: #! perl -slw use strict; use List::Util qw[ shuffle ]; my @terms = qw[ the quick brown fox jumps over the lazy dog ]; my $re = join'', map "(?=^.($_))?", @terms; $re = qr/$re/; for( 1 .. 10) { my $input = join ' ', (shuffle @terms)[ 1 .. 5 ]; my @found = $input =~ $re; $_ //= 'n/a' for @found; print "Found [ @found ]\nin:'$input'"; } __END__ C:\test>junk48 Found [ the n/a n/a fox n/a over the lazy dog ] in:'fox over lazy the dog' Found [ n/a n/a brown fox n/a over n/a lazy dog ] in:'fox lazy dog brown over' Found [ the quick brown fox n/a n/a the n/a dog ] in:'the quick dog fox brown' Found [ n/a n/a brown fox jumps over n/a lazy n/a ] in:'brown lazy over jumps fox' Found [ the quick brown n/a jumps n/a the lazy n/a ] in:'lazy the quick brown jumps' Found [ the quick n/a n/a jumps n/a the n/a dog ] in:'dog quick the the jumps' Found [ n/a quick brown fox jumps n/a n/a n/a dog ] in:'fox jumps quick brown dog' Found [ the n/a brown n/a n/a over the lazy dog ] in:'over lazy the dog brown' Found [ the n/a brown fox jumps n/a the n/a dog ] in:'the brown dog fox jumps' Found [ the quick n/a fox n/a over the lazy n/a ] in:'fox lazy over quick the' [download] The double matching of 'the' can be a good or bad thing depending upon your purpose. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Regex match at the beginning or end of string by JavaFan (Canon) on Feb 19, 2011 at 23:18 UTC
Re^5: Regex match at the beginning or end of string by BrowserUk (Patriarch) on Feb 20, 2011 at 01:03 UTC
Some notes below your chosen depth have not been shown here
Re^2: Regex match at the beginning or end of string by cyber-guard (Acolyte) on Feb 19, 2011 at 00:44 UTC
Thanks for the answer, could you explain how the regex works, can't quite get my head around it.	[reply]
Re^3: Regex match at the beginning or end of string by toolic (Bishop) on Feb 19, 2011 at 01:48 UTC
That's a job for YAPE::Regex::Explain! The regular expression: (?-imsx:(?=^.fred)(?=.bill)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- fred 'fred' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- bill 'bill' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l]
Re^3: Regex match at the beginning or end of string by wind (Priest) on Feb 19, 2011 at 01:31 UTC
He used positive look ahead assertions to test if the two patterns 'fred' and 'bill' were both in the string being matched. Essentially, it's the same thing as saying /fred/ && /bill/. Also, it's anchored to increase performance, since it if doesn't match at the beginning of the string, it won't match at all. perldoc - perlre: Just search for "Look-Around Assertions"	[reply]
Re: Regex match at the beginning or end of string by wind (Priest) on Feb 19, 2011 at 00:37 UTC
Use qr to cache your independent regex's. `my $pattern1 = qr{abc:$?.?$?\s}; my $pattern2 = qr{${variable}.{0,5}\s}; if ($str =~ /$pattern1(.?)$pattern2/) { print "matched $1"; } if ($str =~ /$pattern2(.*?)$pattern1/) { print "matched $1"; }` [download] Then the length of the regexs won't be an issue.	[reply] [d/l]