silent11 has asked for the wisdom of the Perl Monks concerning the following question:

I am using LWP::Simple to grab a webpage's HTML into a variable, I am running a regex against it to match what is between: and I am a super beginner when it comes to perl and regexs, here is my code:
use CGI qw/:standard/; use LWP::Simple; print header,start_html('My Dictionary'),start_form,"Enter a word:",te +xtfield('word'),p,submit,end_form,hr; $engine = "http://www.dictionary.com/cgi-bin/dict.pl?term="; $word = param('word'); $url = "$engine$word"; $string = get($url); $start = '<!-- resultItemStart -->'; $end = '<!-- resultItemEnd -->'; $string = m/($start)(*)($end)/; print "$2";
I know I should be using a module for this, but I wanna do it on my own.

Replies are listed 'Best First'.
Re: pattern matching
by demerphq (Chancellor) on Jan 16, 2002 at 22:31 UTC
    I know I should be using a module for this, but I wanna do it on my own.

    While I entirely sympathise with the sentiment (ie you wanna use it as a vehicle to learn) I (and no doubt the vast bulk of the monastery) would strongly advise you not to.

    The reason is that quite simply that parsing HTML is a non-trivial act. Further there are a variety of power modules available to you to do this, all of which will require you to learn stuff. Stuff which will in the long term be of far more use than learning how to parse HTML.

    But in the interest of fair play I will say that

    my $start = quotemeta '<!-- resultItemStart -->'; my $end = quotemeta '<!-- resultItemEnd -->'; my $string =~ m/$start(.*?)$end/s;
    Might do what you want if the HTML is very simple.

    Oh yes. For all intents and purposes you should assume that writing code not under strict and warnings is a crime and should be avoided at all costs...

    HTH

    UPDATE: Just realized I originally posted this with a dreaded .* which is a bad move. Take a look at Death to Dot Star! for why. However derekses use of .+? actually isnt good either, as it will skip an empty "record" instead of reporting it. I believe that .*? is the correct choice here. Oh and I did not capture the $start and $end because you already know what they are, dont you? :-)

    Yves / DeMerphq
    --
    When to use Prototypes?

Re: pattern matching
by ViceRaid (Chaplain) on Jan 16, 2002 at 22:36 UTC
    Your LWP stuff looks fine, but you need to make a few changes to your regular expression syntax to change to make this work. Should get you started, but there's plenty of good documentation about using regular expressions about.
    1. you need to use the pattern-match operator =~, not just a plain = to match a $string; ie:
      $string =~ m/
    2. You need to capture something using the * modifier between the start and end of the comment you're looking for. A * just tells Perl to get as much 'something' (where something is the thing preceding the *) as possible. To get as much of anything as possible, use .* ie:
      $string =~ m/($start)(.*)($end)/
    3. The page that's coming back runs over lots of lines. By default, Perl only matches over one line to find a pattern. Use the /s modifier at the end of your regular expression to tell Perl to search over newline boundaries. ie:
      $string =~ m/($start)(.*)($end)/s
    4. Since the target webpage has got lots of pairs of matched comments, and because .* is 'greedy' (it takes as much stuff as possible while still matching a pattern), you'll get everything between the *first* $start and the *last* end. That's loads of stuff. Use the ? (non-greedy) modifier on .* to get just the first result item. ie:
      $string =~ m/($start)(.*?)($end)/s
    5. You don't need to put brackets around $start and $end, because in your programme, you already know what's in there. Brackets capture the stuff that's matched between them and save them, which you don't want to do. ie:
      $string =~ m/$start(.*?)$end/s $result = $1;
    A
Re: pattern matching
by dereks (Scribe) on Jan 16, 2002 at 22:18 UTC
    $string = m/($start)(*)($end)/;
    This won't do anything at all. For starters, you need to use =~ for matching and substitutions. The star (*) is a quantifier, and won't do anything unless you put it after something. In fact, it will give you an error the way you have it now, since it doesn't follow anything. Maybe something like this will be improved:
    $string =~ /$start(.*?)$end/s;
    As an aside, you probably want to start getting used to use strict (although not absolutely necessary in such a small script you have here). Also, CGI::Carp qw/fatalsToBrowser/ will send fatal messages to the browser, making it easier to debug.

    Update: match newlines and no case sensitivity. Thanks Juerd and cLive ;) I figured I would miss something! :)

    Update: as Yves pointed out, (.*?) would be the best way to go! Thanks, Yves.

    - Derek Soviak

      Nearly there:
      $string =~ /$start(.+?)$end/s;

      The case insensitivity isn't needed in this case, but slurping everything up as one line is.

      Then the match is stored in $1.

      You don't need the quotemeta in this case (as Yves points out below), but I guess it's good practice to get into until you know *all* regex meta characters.

      cLive ;-)

      The dot (.) matches anything under the sun; it's a wildcard.
      Your sun has no newlines :)

      To put newlines under the sun, use the /s modifier.

      2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Re: pattern matching
by andye (Curate) on Jan 16, 2002 at 22:39 UTC
    I know I should be using a module for this, but I wanna do it on my own.

    Good for you. I'd rewrite:

    $string = m/($start)(*)($end)/; print "$2";
    as
    $string =~ m/$start(.+?)$end/s; print $1;
    because
    • You use =~ for matching, not =
    • Anything in brackets gets 'captured' to $1, $2, etc. You don't need to capture $start and $end, so there's no need for brackets round them
    • To match a chunk of text that could be anything, I'd use a dot ('any character' when used with /s, otherwise 'any character except a newline') followed by a plus (match that character 1 or more times) followed by a question mark (match as few characters as possible: for more info see Death to Dot Star!)
    • /s at the end, to make the dot match the newline character
    • To print a variable, just do print $var - it doesn't need quotes round it.

    Hope that helps,
    andy.

      As I mention above in my update the .+? is wrong. Consider
      <!-- record --> Blah 1 <!-- eorecord --> <h1>Html crap</h1> <!-- record --><!-- eorecord --> <!-- record --> Blah 2 <!-- eorecord -->
      Using the .+? we will get only two records out of the file. Using .*? we would get the correct number, three.

      Still I think its a bit funny (but not surprising) how similer our posts are... :-)

      Yves / DeMerphq
      --
      When to use Prototypes?

        Well, I guess you're right, if you'd rather have a successful (blank) pattern match than a failed match. And I guess the web page could concievably contain a blank entry, so that makes sense.

        But in the original context, we're only trying to match one record anyway, so if the match fails then $1 will contain '', and if you use .*? and the match succeeds then $1 will contain '', so I'm not sure it makes any difference. ;)

        andy.