eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Hola Monks...

I am very new to perl... I need to search an html page returned as a string by the LWP module (are there other ways to have it return, by the way?) for the following data:
<!--QBlastInfoBegin RID = 1055976860-01972-17207 RTOE = 7 QBlastInfoEnd -->
and extract the two numbers into two scalars. What's the obvious way to do this?

Thanks...
~evan

Replies are listed 'Best First'.
Re: html page search/parse
by BrowserUk (Patriarch) on Jun 18, 2003 at 23:22 UTC

    If you require to extract anything more than this simple snippet, you might want to look into the various HTML::* modules, but for just this, something like

    my $re = qr[<!--QBlastInfoBegin \s+ RID = ([\d-]+) \s+ RTOE = (\d+) \s+ QBlastInfoEnd \s+ -->]x; my( $RID, $RTOE ) = $html =~ $re;

    might come close. (Note: Untested).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Sorry to sound dumb, but...

      ...it doesn't work. The two values are empty. What exactly is going on in this regex expression? If somebody could explain it bit by bit I could probably figure it out.

      Thanks!
      ~evan

      PS. What are the +'s about? Are they supposed to be there?

        Sorry, my bad. Try this.

        #! perl -slw use strict; my $re = qr[ <!--QBlastInfoBegin # Match the start of comment \s+ # 1 or more whitespace including newlines RID # 'RID' literal \s+ # One or more whitespace = # '=' \s+ # more whitespace ( # start capturing to $1 [\d-]+ # 1 or more '0-9' or '-' ) # end capture \s+ # yet more whitespace RTOE # 'RTOE' literal \s+ # And more whitespace = # '=' literal \s+ # more ( # start capture to $2 \d+ # 1 or more digits ) # end capture \s+ # more whitespace QBlastInfoEnd # the end token \s+ # final whitespace (including newlines) --> # The end comment card ]x; # Ignore incidental spacing and comments in + regex. my $html = do{ local $/; <DATA> }; Grab the data from <DATA> into a st +ring my( $RID, $RTOE ) = $html =~ $re; # Execute the regex and assign the c +aptures to variables. print "RID:$RID RTOE:$RTOE"; # Print the results. __DATA__ <!--QBlastInfoBegin RID = 1055976860-01972-17207 RTOE = 7 QBlastInfoEnd -->

        Without the verbose commenting, the (now tested and working) regex looks like this

        my $re = qr[ <!--QBlastInfoBegin \s+ RID \s+ = \s+ ( [\d-]+ ) \s+ RTOE \s+ = \s+ ( \d+ ) \s+ QBlastInfoEnd \s+ --> ]x;

        The +'s mean match 1 or more of the preceeding element. See perlre and perlretut for more.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


        He didn't allow for spaces between the two equals signs. Try this version:
        my $re = qr[<!--QBlastInfoBegin \s+ RID \s* = \s* ([\d-]+) \s+ RTOE \s* = \s* (\d+) \s+ QBlastInfoEnd \s+ -->]x;
        As for the pluses, they are quantifiers and make the expression match one or more spaces (in this case). See perldoc perlre for more info.