I am looking for people to test or critique this module but keep in mind this is my first attempt at one and I might now know everything (s/might/sure as heck don't/) about them yet.

Synopsis:

use MetaParser; my $parse = new MetaParser; my $content = $parse->getc('http://www.spydersubmission.com'); my %meta = $parse->meta('http://www.spydersubmission.com'); print $content; foreach (keys %meta) { print "$_ => $meta{$_}\n"; }

Description: Provides a very simple way to extract meta content from a web page.

Object methods:

  • $parse->getc("$url");
        Retrieves the entire page source code. Identical to LWP::Simple's get()

  • $parse->meta("$url");
        Retrieves the meta content from the header of the given URL. Returned as a hash.

    Example:

    #!/usr/bin/perl use warnings; use strict; use MetaParser; my $parse = new MetaParser; my %meta = $parse->meta('http://www.spydersubmission.com'); foreach (keys %meta) { print "$_ => $meta{$_}\n"; }
    Yields:
    language => EN-US copyright => 2004 SpyderSubmission.com author => SpyderSubmission description => Certified marketing consultants who will bring your + site to the top of engines for less than your morning coffee. distribution => global rating => general keywords => search engine optimization, search engine optimization + serices, search engine optimization training, search engine position +ing, SEO, websit e submissions, web site submissions, web site promotion, website promo +tion, web site marketing, engine ranking, google page rank distributor => SpyderSubmission robots => index, follow abstract => Leaders in online marketing services

    Source code:

    package MetaParser; use strict; use LWP::Simple; sub new { my $pkg = shift; my $obj = {@_}; $obj = bless {%$obj},$pkg || die 'unable to bless object!'; return $obj; } sub getc { my $obj = shift; my $url = shift; my $content = get($url); return $content; } sub meta { my $obj = shift; my $url = shift; my $content = get($url); die "Error retriving $url" unless defined $content; my @content_lines = split(/\n/, $content); # let's make a gigan +tic string with all the my $single_line = join("", @content_lines); # lines of HTML on on +e line. Come on, it'll be fun my %meta; # <meta name = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*>/gi; # <meta name = 'name' content = 'content' \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*>/gi; # <meta http-equiv = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*>/gi; # <meta content = "content" name = "name" \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*>/gi; # <meta content = 'content' name = 'name' \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*>/gi; return %meta; } 1;
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Yes, I used regexes to parse the HTML instead of using other modules to do it for me and because of that, I know this isn't 100% perfect but either are the other scripts made that parse HTML.

    I know this isn't CPAN worthy but since I deal with meta tags a lot with my scripts, this will be very useful for my projects.

    Please let me know what you think, ways to improve this, things I've missed, etc.

    UPDATE: added more regexes to pick up more tags

    Special thanks to Enlil for assisting with non-greedy regexes and Castaway for finding a real sweet solution of putting the entire source code in a single line.



    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid

    In reply to Critique/Test my first module MetaParser by sulfericacid

    Title:
    Use:  <p> text here (a paragraph) </p>
    and:  <code> code here </code>
    to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.