holdyourhorses has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks.

I am parsing a large chunk of text, extracting the lines where a regex matches a given portion of each row. It means that the regex should only match if the pattern is found within a predefined portion of the row.

For example, I have this (simplified) code:

#!/usr/bin/perl use strict; use warnings; # text should only match within these positions my ($start_boundary, $end_boundary) = (11, 30); my $regex = qr/hello world/i; while (<DATA>) { # first method my $valid_data = substr($_, $start_boundary, $end_boundary - $star +t_boundary +1); if ( $valid_data =~ $regex ) { printf "1) %d %d '%s'\n", $-[0], $+[0], substr($_, $-[0] + $start_boundary , $+[0] - $-[0]) +; } # second method if ( m/$regex/ ) { if ( $-[0] > $start_boundary && $+[0] < $end_boundary ) { printf "2) %d %d '%s'\n", $-[0], $+[0], substr($_, $-[0], $+[0] - $-[0]); } } } __DATA__ some meaningful text containing "hello world" and more hello world should not match here in this row Hello World could match also here my HELLO world has a chance here hello world should be skipped

I want to match the regex (in the real case, it is much more complex than this) only between positions 11 and 30 in the source string.

Both methods that I have found work, i.e. they find the right text. However, the first method needs too much calculation, while the second method will apply the regex to all the lines, and only a subsequent filter will find out if it was a right match.

So, the questions are:

TIA

update Code in first method fixed. Thanks to Roger.

Code in second method is wrong, as ikegami and japhy noted.

Replies are listed 'Best First'.
Re: Applying a regex to part of a string
by davidrw (Prior) on Aug 18, 2005 at 14:04 UTC
    Roger's solution is clearly better than what i'm about to suggest, but you can also limit the positions with your regex.. in this case something like /^.{10,20}hello world/si since, for 'hello world' to be in the first 30 characters it must not have more than 20 chars (i might have counted wrong by one) preceeding it, but there must be at least 10 before it.

      I like it.

      This one seems to have the least side effects, i.e. I don't have to extract anything before, and the regex will take care of failing when the pattern is not in the wanted position.

      So simple, but I did not think about it!

      Thanks.

Re: Applying a regex to part of a string
by japhy (Canon) on Aug 18, 2005 at 16:09 UTC
    Your second way doesn't work if there's a bad match found before a good match can be found. You could embed some logic into the regex, but I'd really probably just use that first method of yours. Here's the steroid-enhanced way:
    my ($START, $END) = (11, 30); if ($string =~ /^\C{$START,$END}?($pattern)(?(?{ $+[0] < $END })|(?!)) +/) { # it's ok }
    It matches (in the test case) between 11 and 30 characters at the beginning of the string, and then tries matching the pattern. If, after it matches the pattern, $+[0] is still less than $END, then it succeeds; otherwise, it backtracks.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

      This is something I need to study more thoroughly. At the moment, I am not sure I fully understand the details of what you propose, although I can see how it works in principle.

      Thanks

Re: Applying a regex to part of a string
by ikegami (Patriarch) on Aug 18, 2005 at 14:24 UTC

    Since substr returns an lvalue when less than four arguments are provided, you can do substitutions on portions of the string.

    my $str = 'abracadabra'; substr($str, 3, 5) =~ s/a/@/g; print("$str\n"); # abr@c@d@bra

    I realize this is off-topic, but it's a neat trick that's somewhat related.

Re: Applying a regex to part of a string
by Roger (Parson) on Aug 18, 2005 at 13:56 UTC
    substr($str,$start,$end-$start+1) =~ m/$regex/

      That's already what I am doing in my first method. I was using a variable to make the example code look better. I should have mentioned that in real world cases I apply "substr" to the original string on the fly.

      Thanks anyway.

        Your substr is missing the +1 bit. It would only match between character positions 11 and 29, not 30.

Re: Applying a regex to part of a string
by ikegami (Patriarch) on Aug 18, 2005 at 14:28 UTC
    Your second match doesn't work for the string "hello world ....hello world". It'll match the first "hello world" and decide it's out of bounds, without seeing the second "hello world" within the bounds.

      I knew that my second method looked fishy, and now I know why, because it's wrong! Good catch. Thanks.