html page search/parse

eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: html page search/parse by BrowserUk (Patriarch) on Jun 18, 2003 at 23:22 UTC
If you require to extract anything more than this simple snippet, you might want to look into the various HTML::* modules, but for just this, something like `my $re = qr[<!--QBlastInfoBegin \s+ RID = ([\d-]+) \s+ RTOE = (\d+) \s+ QBlastInfoEnd \s+ -->]x; my( $RID, $RTOE ) = $html =~ $re;` [download] might come close. (Note: Untested). Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: html page search/parse by eweaverp (Scribe) on Jun 18, 2003 at 23:30 UTC
Sorry to sound dumb, but... ...it doesn't work. The two values are empty. What exactly is going on in this regex expression? If somebody could explain it bit by bit I could probably figure it out. Thanks! ~evan PS. What are the +'s about? Are they supposed to be there?	[reply]
Re: Re: Re: html page search/parse by BrowserUk (Patriarch) on Jun 19, 2003 at 00:09 UTC
Sorry, my bad. Try this. #! perl -slw use strict; my $re = qr[ <!--QBlastInfoBegin # Match the start of comment \s+ # 1 or more whitespace including newlines RID # 'RID' literal \s+ # One or more whitespace = # '=' \s+ # more whitespace ( # start capturing to $1 [\d-]+ # 1 or more '0-9' or '-' ) # end capture \s+ # yet more whitespace RTOE # 'RTOE' literal \s+ # And more whitespace = # '=' literal \s+ # more ( # start capture to $2 \d+ # 1 or more digits ) # end capture \s+ # more whitespace QBlastInfoEnd # the end token \s+ # final whitespace (including newlines) --> # The end comment card ]x; # Ignore incidental spacing and comments in + regex. my $html = do{ local $/; <DATA> }; Grab the data from <DATA> into a st +ring my( $RID, $RTOE ) = $html =~ $re; # Execute the regex and assign the c +aptures to variables. print "RID:$RID RTOE:$RTOE"; # Print the results. __DATA__ <!--QBlastInfoBegin RID = 1055976860-01972-17207 RTOE = 7 QBlastInfoEnd --> [download] Without the verbose commenting, the (now tested and working) regex looks like this `my $re = qr[ <!--QBlastInfoBegin \s+ RID \s+ = \s+ ( [\d-]+ ) \s+ RTOE \s+ = \s+ ( \d+ ) \s+ QBlastInfoEnd \s+ --> ]x;` [download] The +'s mean match 1 or more of the preceeding element. See perlre and perlretut for more. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]
Re: Re: Re: html page search/parse by The Mad Hatter (Priest) on Jun 18, 2003 at 23:45 UTC
He didn't allow for spaces between the two equals signs. Try this version: `my $re = qr[<!--QBlastInfoBegin \s+ RID \s* = \s* ([\d-]+) \s+ RTOE \s* = \s* (\d+) \s+ QBlastInfoEnd \s+ -->]x;` [download] As for the pluses, they are quantifiers and make the expression match one or more spaces (in this case). See `perldoc perlre` for more info.	[reply] [d/l] [select]