in reply to Re: Parsing out URLs with regex
in thread Parsing out URLs with regex

Agreed. When working on code for heavy use, don't reinvent the wheel.

For learning purposes, though, you don't want (\w*) for the maximum number of consecutive word characters, you want (.*?) for the minimum number of characters followed by the closing quote.

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^3: Parsing out URLs with regex (diedotstar)
by tye (Sage) on May 14, 2003 at 19:46 UTC

    Actually, this is a good example of when .*? is not the best choice. [^"]* is a much better idea. You don't want to run into this problem:

    $page= '<a href="foo">...' . '<a href="bar" title="baz"><b>Click Here'; $page =~ /<a href="(.*?)" title="(.*?)"><b>Click Here/i;
    where $1 will contain 'foo">...<a href="bar'.

                    - tye

      Whups, didn't see the mandatory title="" in the match. Jumped the gun.

      --
      [ e d @ h a l l e y . c c ]

        Note that such isn't really the problem. Putting nearly anything after or before the .*? in a regex can cause you problems. Even just

        /<a href="(.*?)">/i
        will match way too much by matching way too early against
        '<a href="oops" lots of stuff <a href="ok">' '<a href="oops" > d'oh! whitespace! <a href="ok">' '<a href="oops, break a browser? <a href="ok">'
        (:

                        - tye