http://qs1969.pair.com?node_id=507076

jithoosin has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
my $line= "onmouseout=\"cs()\">Otani <b>Restaurant</b> &amp; <b>Sushi< +/b> Bar: <b>Columbus</b> <b>Ohio</b> DiningGuide \ <b>Restaurant</b>...</a>csdkjfhasdkjghlkjhdfgkj";
and i want to extract every thing from Otani to </a> excluding </a> with out use of escape character. So i used /Q and /E like this $line =~ m@\Qonmouseout="cs()">(.*)</a>\E@; Then i printed out $1 .But it didnot work. Why doesn't it work ? Is there any other way with out using escape characters?

Replies are listed 'Best First'.
Re: simple regEx
by marto (Cardinal) on Nov 09, 2005 at 14:27 UTC
    Hi jithoosin ,

    Rather than extracting this data from HTML via a regex you may want to use the HTML::TokeParser module to return the data you wish from the HTML. If you do a Super Search on this topic there are a lot of posts to back the use of this module rather than using regex.

    Hope this helps.

    Martin
Re: simple regEx
by Corion (Patriarch) on Nov 09, 2005 at 14:22 UTC

    Your use of \Q...\E is too general - it also escapes the parentheses you wanted to use for capturing and your wildcards. Try it with:

    $line =~ m@\Qonmouseout="cs()">\E(.*)\Q</a>\E@;

    Update: After looking through wws analysis and seeing that my code doesn't work - you also need the /s switch to make . match newlines, as your string contrains newlines:

    $line =~ m@\Qonmouseout="cs()">\E(.*)\Q</a>\E@s;
Re: simple regEx
by tphyahoo (Vicar) on Nov 09, 2005 at 14:50 UTC
    I agree with marto that when parsing HTML it's better to use one of the HTML::* modules than a regex. However, I usually use HTML::Treebuilder rather than TokeParser. We all have our favorite ways.

    A good discussion of various ways to parse html, from someone who actually prefers doing it with a regex and disses all us HTML::*ers is at Being a heretic and going against the party line.. Good luck!

Re: simple regEx
by ww (Archbishop) on Nov 09, 2005 at 15:01 UTC
    As Corion pointed out, (.*) does not function as a capture inside \Q...\E.... but note, also, that if it did, so too would the empty pair of parentheses immediately after "cs". AFAIK (and as far as I can tell from some limited experimentation after reading your post), the regex engine does not squawk (OUT LOUD!) about an empty capture spec.

    What's more, you are already using escape characters -- in the string assigned to $line.Have you carefully considered your requirement that the regex not use the same tool?

    Updated with code and output after of some testing while *STILL* making myself crazed: Anyone see a problem here?
    (and, please, find no fault with those who cast 3 upvotes, before I committed this [questionable] update)