Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: New regex trick...

by erikharrison (Deacon)
on Jul 22, 2002 at 19:33 UTC ( [id://184195]=note: print w/replies, xml ) Need Help??


in reply to New regex trick...

I understand (I think) why this is fast. What I don't understand is how it works. I'll admit that I am not much of regex hacker, but how does the variable length look behind now how far to look, in the case of quantifiers, especially greedy ones? It seems that the goal of the anchor is to prevent backtracking (hence the speed gains) but how does it now when to stop? If the \K anchor tells the regex engine to watch for the oncoming regex and stop matching there, whats the difference between \K and a minimal match, or negated character class? If the engine doesn't stop at the first instance of, say '.' (as in your example) how do you keep it from backtracking (which is where I'm presuming the speed gains come from)? Or is \K an optimization using sexegers?

Now, japhy I have absolute faith in your pattern - foo, so I take it on faith that this works. What I'm curious about is how. Am I missing something in thinking that the performance gains here are related to backtracking?

Cheers,
Erik

Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

Replies are listed 'Best First'.
Re: Re: New regex trick...
by japhy (Canon) on Jul 22, 2002 at 19:56 UTC
    You're thinking too hard. I cheated. My patch just fools the regex engine into thinking it hasn't actually started matching yet. Here's a drawn-out example.
    $str = "abc.def.ghi.jkl"; $str =~ s{ .* # match as much as you can \K # and then pretend HERE is where we start \. .* # then a . and anything else }{}x; # replace with nothing __END__ abc.def.ghi.jkl $& AAAAAAAAAAA .* "abc.def.ghi" \K "" B \. "." BCCC .* ".jkl"
    Does that help you see what I do? My patch consists of a couple lines of support, but this is the beef:
    case KEEP: PL_regstartp[0] = locinput - PL_bostr; break;
    That's what happens when the regex engine encounters the \K. The rest of the patch is just creating the "KEEP" node, and telling toke.c that "\K" is a valid escape sequence.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Okay, I know it's because I'm dumb, but I still don't get it. Please don't yell at me, but why does the \K anchor keep .* from matching .jkl? And if it backtracks like normal, then where does the speed come from? I think that may be the essence of my confusion - why is this faster?

      If I don't get it this time, I'll give up and just trust it ;-).

      Cheers,
      Erik

      Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet

        Oh, \K doesn't stop .* from matching the entire string. Perl is smart enough to back off to the last "." when the \. node comes up.

        What \K is doing is faking WHERE in the string (and the pattern) the regex started to match. Compare:

        $str = "Match 9 the 1 last 6 digit 2 blah"; $str =~ /.*\d/; print "[$`] [$&] [$']\n"; $str =~ /.*\K\d/; print "[$`] [$&] [$']\n"; __END__ [] [Match 9 the 1 last 6 digit 2] [ blah] [Match 9 the 1 last 6 digit ] [2] [ blah]
        See, \K tells $& that THIS is where it begins. This is useful in substitutions:
        # you go from this: s/(saveme)deleteme/$1/; # to this: s/saveme\Kdeleteme//;
        And you save time on replacing "saveme" with itself.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Erik, the bit about .* not matching .jkl is just plain old regex engine rules. That is to say, the match wouldn't succeed unless the last literal period (\.) is followed by 0 or more things (.*), eh? I'm assuming you're talking about the first .*, not the second one.
        Paul

        When there is no wind, row.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://184195]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2024-04-19 09:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found