Re: HTML source grab

Sorry guys, I was stupid and forgot to post the script. Virtually I want to filter out and print only Meta Tag codings. Note: there are many Meta Tag codings such as meta name=copyright and meta name=keywords if that means anything. I use LWP:UserAgent, I was told by a perlmonk yesterday I should have used LWP:Simple instead. Thanks!! Script:

#!/usr/bin/perl -w

use CGI qw(:all);
use Fcntl qw(:flock);
use HTTP::Request;

print header;        

open(F, "pageinfo.htm");

while(<F>)
    {
    print "$_";
    }

if(!param)
    {
 exit;    
}



$fristring = param('urlinfo');

  use LWP::UserAgent;
  $ua = new LWP::UserAgent;
  $ua->agent("Mozilla/5.0");  
  $ua->timeout('30');
 
  $req = new HTTP::Request GET => $fristring;
  $req->header('Accept' => 'text/html');

  $result = $ua->request($req);
   
$_ = $result->content;

$looper = 0;

while($looper < 5000)
    {
     s/</&lt;/;
        $looper++;
    }
$looper = 0;

while($looper < 500)
    {
     s/\n/<BR>/;
    $looper++;
    }

print $_;
[download]

Comment on Re: HTML source grab Download Code

Replies are listed 'Best First'.
Re: Re: HTML source grab by patgas (Friar) on Mar 12, 2002 at 21:45 UTC
I'll try to point out a few things that immediately leap out at me... You got the right idea by using `-w`. I recommend using `strict` as well to prevent typos, etc. There's plenty of stuff around PerlMonks extolling the virtues of it, so I won't go over it here. You're not checking the status of opening your file for reading. Try something like this instead: `open( F, '<', 'pageinfo.htm' ) or die "Can't open pageinfo.htm: $!";` Your first while loop to print the contents of the file can be shortened to `print while <F>;` You're not closing the file after you're done with it. You should check the result of your LWP request using `$request->is_success` and show appropriate error messages, etc. I'm not even sure what you're trying to do with those last two while loops. The `$result->content` is one big scalar, so one greedy substitution on it will take care of each of these. I.E.: `s/</</g; s/\n/<BR>/g;` And finally, here's a quick example of using HTML::TokeParser to get the meta tags out of an HTML document: `#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; my $source = get( shift \|\| 'http://www.perlmonks.org' ); my $parser = HTML::TokeParser->new( \$source ); while ( my $tag = $parser->get_tag( 'meta' ) ) { print $tag->[3], "\n"; } print "Done.\n";` [download] I'll leave it as an excercise to use this in a CGI document and format your output accordingly. I hope all this helps... good luck! "As information travels faster in the modern age, as our days are crawling by so slowly." -- DCFC	[reply] [d/l]
Re: Re: HTML source grab by jryan (Vicar) on Mar 13, 2002 at 16:05 UTC
In addition to the excellent comments by patgas, I'd like to add a few things. Near the start of your script, you `use CGI qw(:all);` While it is very good that you are using the CGI module, and I definately don't want to discourage that, I don't see why you want to use the `:all` tag. The `:all` tag will export all bajillion functions from the CGI module into your namespace, which will definately slow your script down. For most general uses, the `qw(:standard)` tag will be enough for all of your needs, and in this specific case, `qw(header param);` will be be sufficient since you are only using those 2 functions from the CGI module. You `use Fcntl qw(:flock);` but never use the flock() function. Do you realize that flock needs to be manually called in the code, and that file locking isn't automatic? See the flock documentation for more details and examples. I'm not sure where flock would do you any good in this case, as files only open for reading don't need to be flocked. Flock is there so that multiple scripts that have the same file open for writing do not overwrite each other's output. Variables are allowed in Perl (and even have been known to increase readability!); you don't have to do EVERYTHING with $_ :) Using LWP::UserAgent in this instance is fine, its just that the monk probably thought that LWP::Simple might be easier for you to use. Additionaly, to parse out meta tags, you might want to take a look at HTML::TokeParser, as it was designed to parse out specific tags. Luckily for you, HTML::TokeParser is part of the LWP distribution so you already have it. At any rate, if the examples in the docs aren't enough, try using the Super Search to dig up more examples.	[reply] [d/l] [select]