HTML source grab

venimfrogtongue has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML source grab by steves (Curate) on Mar 06, 2002 at 05:07 UTC
I don't see the source ... To parse specific tags out of the returned HTMP content, I'd recommend using something like HTML::Parser. There are plenty of examples in the docs on how to pull content from specific tags. Can't comment on the easier way without the code, but I don't get the "don't use LWP" comment. LWP::UserAgent is about the simplest, yet most complete package for automating web access I've ever used. What are they suggesting you use instead?	[reply]
Re: HTML source grab by silent11 (Vicar) on Mar 06, 2002 at 05:47 UTC
steves is right. LWP::Simple and HTML::Parser is the way to go. Take a look at some code I posted here earlier today, hopfully it helps. you may also want to look at this tutorial in the tutorial section. -Silent11	[reply]
Re: HTML source grab by cjf (Parson) on Mar 06, 2002 at 05:52 UTC
Use LWP::UserAgent to grab the HTML source and then use HTML::Parser to find the meta tags. You can then print the meta tags to the browser or do whatever with them. Also make sure to pay attention to security, especially if you're accepting input from the browser. I'm sure you don't want people using your script to feed nasty stuff to other people's scripts.	[reply]
Re: HTML source grab by gellyfish (Monsignor) on Mar 06, 2002 at 09:54 UTC
As well as the excellent advice given by others here you might also want to take a quick look at these examples to see if there is anything there that might help /J\	[reply]
Re: HTML source grab by venimfrogtongue (Novice) on Mar 06, 2002 at 21:07 UTC
Sorry guys, I was stupid and forgot to post the script. Virtually I want to filter out and print only Meta Tag codings. Note: there are many Meta Tag codings such as meta name=copyright and meta name=keywords if that means anything. I use LWP:UserAgent, I was told by a perlmonk yesterday I should have used LWP:Simple instead. Thanks!! Script: #!/usr/bin/perl -w use CGI qw(:all); use Fcntl qw(:flock); use HTTP::Request; print header; open(F, "pageinfo.htm"); while(<F>) { print "$_"; } if(!param) { exit; } $fristring = param('urlinfo'); use LWP::UserAgent; $ua = new LWP::UserAgent; $ua->agent("Mozilla/5.0"); $ua->timeout('30'); $req = new HTTP::Request GET => $fristring; $req->header('Accept' => 'text/html'); $result = $ua->request($req); $_ = $result->content; $looper = 0; while($looper < 5000) { s/</</; $looper++; } $looper = 0; while($looper < 500) { s/\n/<BR>/; $looper++; } print $_; [download]	[reply] [d/l]
Re: Re: HTML source grab by patgas (Friar) on Mar 12, 2002 at 21:45 UTC
I'll try to point out a few things that immediately leap out at me... You got the right idea by using `-w`. I recommend using `strict` as well to prevent typos, etc. There's plenty of stuff around PerlMonks extolling the virtues of it, so I won't go over it here. You're not checking the status of opening your file for reading. Try something like this instead: `open( F, '<', 'pageinfo.htm' ) or die "Can't open pageinfo.htm: $!";` Your first while loop to print the contents of the file can be shortened to `print while <F>;` You're not closing the file after you're done with it. You should check the result of your LWP request using `$request->is_success` and show appropriate error messages, etc. I'm not even sure what you're trying to do with those last two while loops. The `$result->content` is one big scalar, so one greedy substitution on it will take care of each of these. I.E.: `s/</</g; s/\n/<BR>/g;` And finally, here's a quick example of using HTML::TokeParser to get the meta tags out of an HTML document: `#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; my $source = get( shift \|\| 'http://www.perlmonks.org' ); my $parser = HTML::TokeParser->new( \$source ); while ( my $tag = $parser->get_tag( 'meta' ) ) { print $tag->[3], "\n"; } print "Done.\n";` [download] I'll leave it as an excercise to use this in a CGI document and format your output accordingly. I hope all this helps... good luck! "As information travels faster in the modern age, as our days are crawling by so slowly." -- DCFC	[reply] [d/l]
Re: Re: HTML source grab by jryan (Vicar) on Mar 13, 2002 at 16:05 UTC
In addition to the excellent comments by patgas, I'd like to add a few things. Near the start of your script, you `use CGI qw(:all);` While it is very good that you are using the CGI module, and I definately don't want to discourage that, I don't see why you want to use the `:all` tag. The `:all` tag will export all bajillion functions from the CGI module into your namespace, which will definately slow your script down. For most general uses, the `qw(:standard)` tag will be enough for all of your needs, and in this specific case, `qw(header param);` will be be sufficient since you are only using those 2 functions from the CGI module. You `use Fcntl qw(:flock);` but never use the flock() function. Do you realize that flock needs to be manually called in the code, and that file locking isn't automatic? See the flock documentation for more details and examples. I'm not sure where flock would do you any good in this case, as files only open for reading don't need to be flocked. Flock is there so that multiple scripts that have the same file open for writing do not overwrite each other's output. Variables are allowed in Perl (and even have been known to increase readability!); you don't have to do EVERYTHING with $_ :) Using LWP::UserAgent in this instance is fine, its just that the monk probably thought that LWP::Simple might be easier for you to use. Additionaly, to parse out meta tags, you might want to take a look at HTML::TokeParser, as it was designed to parse out specific tags. Luckily for you, HTML::TokeParser is part of the LWP distribution so you already have it. At any rate, if the examples in the docs aren't enough, try using the Super Search to dig up more examples.	[reply] [d/l] [select]