sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:
Synopsis:
use MetaParser; my $parse = new MetaParser; my $content = $parse->getc('http://www.spydersubmission.com'); my %meta = $parse->meta('http://www.spydersubmission.com'); print $content; foreach (keys %meta) { print "$_ => $meta{$_}\n"; }
Description: Provides a very simple way to extract meta content from a web page.
Object methods:
Example:
Yields:#!/usr/bin/perl use warnings; use strict; use MetaParser; my $parse = new MetaParser; my %meta = $parse->meta('http://www.spydersubmission.com'); foreach (keys %meta) { print "$_ => $meta{$_}\n"; }
language => EN-US copyright => 2004 SpyderSubmission.com author => SpyderSubmission description => Certified marketing consultants who will bring your + site to the top of engines for less than your morning coffee. distribution => global rating => general keywords => search engine optimization, search engine optimization + serices, search engine optimization training, search engine position +ing, SEO, websit e submissions, web site submissions, web site promotion, website promo +tion, web site marketing, engine ranking, google page rank distributor => SpyderSubmission robots => index, follow abstract => Leaders in online marketing services
Source code:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~package MetaParser; use strict; use LWP::Simple; sub new { my $pkg = shift; my $obj = {@_}; $obj = bless {%$obj},$pkg || die 'unable to bless object!'; return $obj; } sub getc { my $obj = shift; my $url = shift; my $content = get($url); return $content; } sub meta { my $obj = shift; my $url = shift; my $content = get($url); die "Error retriving $url" unless defined $content; my @content_lines = split(/\n/, $content); # let's make a gigan +tic string with all the my $single_line = join("", @content_lines); # lines of HTML on on +e line. Come on, it'll be fun my %meta; # <meta name = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*>/gi; # <meta name = 'name' content = 'content' \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*>/gi; # <meta http-equiv = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*>/gi; # <meta content = "content" name = "name" \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*>/gi; # <meta content = 'content' name = 'name' \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*>/gi; return %meta; } 1;
Yes, I used regexes to parse the HTML instead of using other modules to do it for me and because of that, I know this isn't 100% perfect but either are the other scripts made that parse HTML.
I know this isn't CPAN worthy but since I deal with meta tags a lot with my scripts, this will be very useful for my projects.
Please let me know what you think, ways to improve this, things I've missed, etc.
UPDATE: added more regexes to pick up more tags
Special thanks to Enlil for assisting with non-greedy regexes and Castaway for finding a real sweet solution of putting the entire source code in a single line.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Critique/Test my first module MetaParser
by tachyon (Chancellor) on Nov 15, 2004 at 01:51 UTC | |
|
Re: Critique/Test my first module MetaParser
by chromatic (Archbishop) on Nov 15, 2004 at 01:40 UTC | |
|
Re: Critique/Test my first module MetaParser
by tachyon (Chancellor) on Nov 15, 2004 at 02:27 UTC | |
|
Re: Critique/Test my first module MetaParser
by brian_d_foy (Abbot) on Nov 15, 2004 at 05:52 UTC | |
| |
|
Re: Critique/Test my first module MetaParser
by ysth (Canon) on Nov 15, 2004 at 04:09 UTC | |
|
Re: Critique/Test my first module MetaParser
by FoxtrotUniform (Prior) on Nov 15, 2004 at 01:31 UTC |