sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for people to test or critique this module but keep in mind this is my first attempt at one and I might now know everything (s/might/sure as heck don't/) about them yet.

Synopsis:

use MetaParser; my $parse = new MetaParser; my $content = $parse->getc('http://www.spydersubmission.com'); my %meta = $parse->meta('http://www.spydersubmission.com'); print $content; foreach (keys %meta) { print "$_ => $meta{$_}\n"; }

Description: Provides a very simple way to extract meta content from a web page.

Object methods:

  • $parse->getc("$url");
        Retrieves the entire page source code. Identical to LWP::Simple's get()

  • $parse->meta("$url");
        Retrieves the meta content from the header of the given URL. Returned as a hash.

    Example:

    #!/usr/bin/perl use warnings; use strict; use MetaParser; my $parse = new MetaParser; my %meta = $parse->meta('http://www.spydersubmission.com'); foreach (keys %meta) { print "$_ => $meta{$_}\n"; }
    Yields:
    language => EN-US copyright => 2004 SpyderSubmission.com author => SpyderSubmission description => Certified marketing consultants who will bring your + site to the top of engines for less than your morning coffee. distribution => global rating => general keywords => search engine optimization, search engine optimization + serices, search engine optimization training, search engine position +ing, SEO, websit e submissions, web site submissions, web site promotion, website promo +tion, web site marketing, engine ranking, google page rank distributor => SpyderSubmission robots => index, follow abstract => Leaders in online marketing services

    Source code:

    package MetaParser; use strict; use LWP::Simple; sub new { my $pkg = shift; my $obj = {@_}; $obj = bless {%$obj},$pkg || die 'unable to bless object!'; return $obj; } sub getc { my $obj = shift; my $url = shift; my $content = get($url); return $content; } sub meta { my $obj = shift; my $url = shift; my $content = get($url); die "Error retriving $url" unless defined $content; my @content_lines = split(/\n/, $content); # let's make a gigan +tic string with all the my $single_line = join("", @content_lines); # lines of HTML on on +e line. Come on, it'll be fun my %meta; # <meta name = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*"([^"]+) +"\s*content\s*=\s*"([^"]+)"\s*>/gi; # <meta name = 'name' content = 'content' \> $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+name\s*=\s*'([^']+) +'\s*content\s*=\s*'([^']+)'\s*>/gi; # <meta http-equiv = "name" content = "content" \> $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*\/>/gi; $meta{$1} = $2 while $single_line =~ m/<meta\s+http-equiv\s*=\s*"( +[^"]+)"\s*content=\s*"([^"]+)"\s*>/gi; # <meta content = "content" name = "name" \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*"([^" +]+)"\s*name\s*=\s*"([^"]+)"\s*>/gi; # <meta content = 'content' name = 'name' \> $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*\/>/gi; $meta{$2} = $1 while $single_line =~ m/<meta\s+content\s*=\s*'([^' +]+)'\s*name\s*=\s*'([^']+)'\s*>/gi; return %meta; } 1;
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Yes, I used regexes to parse the HTML instead of using other modules to do it for me and because of that, I know this isn't 100% perfect but either are the other scripts made that parse HTML.

    I know this isn't CPAN worthy but since I deal with meta tags a lot with my scripts, this will be very useful for my projects.

    Please let me know what you think, ways to improve this, things I've missed, etc.

    UPDATE: added more regexes to pick up more tags

    Special thanks to Enlil for assisting with non-greedy regexes and Castaway for finding a real sweet solution of putting the entire source code in a single line.



    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid
  • Replies are listed 'Best First'.
    Re: Critique/Test my first module MetaParser
    by tachyon (Chancellor) on Nov 15, 2004 at 01:51 UTC

      Your parsing is very broken. You are making a whole load of invalid assumptions about the structure of a meta tag. You don't have to look very far for common examples of HTML syntax that breaks your parser. Perlmonks for example has meta tags. Your code does not extract them...... Please learn to use HTML::Parser or learn a lot more about what is and is not valid HTML.

      use LWP::Simple; use HTML::Parser; use Data::Dumper; my $data = get('http://www.spydersubmission.com'); my $p = HTML::Parser->new( api_version => 3, start_h => [ \&start, "self,tagname,attr" ], ); sub start { my ( $self, $tagname, $attr ) = @_; return unless $tagname eq 'meta'; my $name = $attr->{name} || $attr->{'http-equiv'} || undef; return unless defined $name; $self->{meta}->{$name} = $attr->{content} || 'NULL'; } $p->parse($data); $p->eof; print Dumper $p->{meta};

      cheers

      tachyon

    Re: Critique/Test my first module MetaParser
    by chromatic (Archbishop) on Nov 15, 2004 at 01:40 UTC

      Good work; it's nice to see people contributing new modules (and I could have used a nice prewritten module for this purpose a time or two in the past.) I have a few questions worth considering though.

      Why the indirect object notation?

      my $parse   = new MetaParser;

      Why create and dereference and rereference a hash, especially when your constructor takes no arguments?

      my $obj = {@_}; $obj = bless {%$obj},$pkg || die 'unable to bless object!';

      Why would bless fail?

      $obj = bless {%$obj},$pkg || die 'unable to bless object!';

      What benefit is there to making this an object?

    Re: Critique/Test my first module MetaParser
    by tachyon (Chancellor) on Nov 15, 2004 at 02:27 UTC

      As previously noted your parsing is broken. If you really want to use REs something like this is *less broken*

      use LWP::Simple; use Data::Dumper; my $content = get('http://www.perlmonks.com') or die $!; $content =~ s/\s+/ /g; # condense whitespace (optional) my @meta = $content =~ m/<\s*meta([^>]+)>/gi; for ( @meta ) { m/name\s*=\s*"([^"]+)/i # double quotes || m/name\s*=\s*'([^']+)/i # single quotes || m/name\s*=\s*(\S+)/i # no quotes || m/http-equiv\s*=\s*"([^"]+)/i || m/http-equiv\s*=\s*'([^']+)/i || m/http-equiv\s*=\s*(\S+)/i || next; my $name = $1; m/content\s*=\s*"([^"]+)/i || m/content\s*=\s*'([^']+)/i || m/content\s*=\s*(\S+)/i || next; $meta{$name} = $1; } print Dumper \%meta;

      cheers

      tachyon

    Re: Critique/Test my first module MetaParser
    by brian_d_foy (Abbot) on Nov 15, 2004 at 05:52 UTC

      Although you might be coding this for fun and edification, you might want to use HTML::Parser which comes with HTML::HeadParser.

      You'll drive yourself nuts trying to use regexen for this. They are the wrong tool for this because there are too many patterns to consider.

      --
      brian d foy <bdfoy@cpan.org>
      A reply falls below the community's threshold of quality. You may see it by logging in.
    Re: Critique/Test my first module MetaParser
    by ysth (Canon) on Nov 15, 2004 at 04:09 UTC
      I don't see the distinction between "useful" and "CPAN worthy". If it meets your needs, perhaps it will meet others.

      Disclaimer: this comment is not intended to promote the use of regular expressions to parse HTML, solve chess problems, or clean your bathroom sink.

    Re: Critique/Test my first module MetaParser
    by FoxtrotUniform (Prior) on Nov 15, 2004 at 01:31 UTC

      If your module deals with meta information only, why include getc?

      --
      Yours in pedantry,
      F o x t r o t U n i f o r m

      "Anything you put in comments is not tested and easily goes out of date." -- tye