Working with source of returned web page

clone4 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I've got this code returning source of any web page :

use LWP::UserAgent;
use HTTP::Cookies;

$ua = new LWP::UserAgent;


$url = "http://www.google.com/";
    print "$url\n";
    
    $ua->agent("Mozilla/5.0 (Windows; U; Windows NT 5.1; cs; rv:1.8.1.
+12) Gecko/2008020121 Firefox/2.0.0.12");
                            $request = new HTTP::Request ('GET',$url);
                            
                            $response = $ua->simple_request( $request 
+);
                            
    if ($response->is_success) {
                                    
        print $request->as_string();
        print $response->content;
        
        }
[download]

But now I need to aim specific string(e.g. <hr> random number<hr>) , and assign it to a variable.
Any ideas how to do that ?

Thanks for any help at all

Comment on Working with source of returned web page Download Code

Replies are listed 'Best First'.
Re: Working with source of returned web page by pc88mxer (Vicar) on Jun 09, 2008 at 19:49 UTC
Is it that your web page has something like: `<hr>34567<hr>` and you want to extract the string `34567`? If so, this is just a simple regex match: `if ($response->content =~ m/<hr>(.*?)<hr>/) { $matched_number = $1; } else { # didn't find a match }` [download] If this isn't what you need, a concrete example of what you are looking for would help.	[reply] [d/l] [select]
Re^2: Working with source of returned web page by tachyon-II (Chaplain) on Jun 09, 2008 at 20:59 UTC
When using regexes to "parse" html it is often useful to use a negative character class instead of .? as all the RE engine has to do is grab stuff up to the next tag. This will of course choke if closing tags are muddled. `m/<tag>([^<])</tag>/` [download]	[reply] [d/l]
Re^2: Working with source of returned web page by clone4 (Sexton) on Jun 09, 2008 at 20:07 UTC
Stupid me, just couldn't get it, it's exactly what I needed. Well big thanks and I'm gonna revise regex a lot more...	[reply]
Re: Working with source of returned web page by Popcorn Dave (Abbot) on Jun 10, 2008 at 03:30 UTC
If you've got the source of the page, you should feed it through something like Toke::HTMLParser to find the specific bit you're looking for. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. I would love to change the world, but they won't give me the source code	[reply]
Re^2: Working with source of returned web page by clone4 (Sexton) on Jun 10, 2008 at 20:39 UTC
wow that's really handy, but when it comes to more specific strings, which aren't defined by any html tags, then it's necessary to use regex anyway... But still as I were saying very useful! Thanks again	[reply]
Re^3: Working with source of returned web page by Popcorn Dave (Abbot) on Jun 10, 2008 at 22:48 UTC
It's been a while since I used that module but if I recall correctly, it parses everything in to a token and the tokens not defined as an HTML tag should be defined as a text token. Take a look at HTML::TokeParser help - parsing headlines and you'll see a quick program I wrote to dump an HTML page to tokenized output. Run that on your page and I think you'll see you don't need to do the regex per se, but rather need to check text tokens to find what you're after. Good luck! Update: Changed link from scratchpad to node as per suggestion by ww Revolution. Today, 3 O'Clock. Meet behind the monkey bars. I would love to change the world, but they won't give me the source code	[reply]