No-Lifer has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, As a (reluctant) perl scribe (a whole module on the scribing of Perl at University! Can you believe it!?) I'm having to write some code. I'm not very good at perl (all programming really) at all, have trawlled Google, and have now come on bended-knee to seek wisdom from yourselves.

I am writing a simple search engine for a few pages in Perl/CGI. Can get most of it to work, however - how, oh how, can I extract the information from Meta tags - i.e.

<META NAME="description" CONTENT="An introduction to the basics of Perl scripting.">

I would wish to pull the Content from the above, and output it to my page. In the most simple way possible. I'm a lay-man remember.

I can successfully take what's between the page title tags, chop off the start/ends and print that out, but the Meta tags are proving a little trickier.

I have many problems with my search engine, but if I could get this working, I would be well on the way forward.

If you'd like a look at it in action, please see www.ally.nu (search for "week" to return all - without "s)(mail: me@ally.nu). Many thanks for any help you can offer!

Al.

Replies are listed 'Best First'.
Re: Meta tag
by Cody Pendant (Prior) on Oct 20, 2005 at 01:52 UTC
    This should get you started:
    use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new( 'foo.html' ) || die "$!"; while ( my $token = $parser->get_token() ) { if ( $token->is_tag( 'meta' ) && $token->get_attr( 'name' ) =~ /description/i ) { print $token->get_attr( 'content' ); } }
    But first you'll probably have to install the module of course. The example opens a local file and you'll have to adapt it to open a URL or a file handle, etc.

    The example above will keep on processing the whole file, which is inefficient. On the other hand, what if you've got two META tags in the document? Do you need to know? Just add a "last" after the print if you don't care.



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
Re: Meta tag
by sawtooth (Initiate) on Oct 19, 2005 at 23:40 UTC
    I'm not sure exactly how you want to implement your search, but perhaps this will help provide a hint as to at least one possibility. I'm sure someone will point out a better way to do it, but I found that this works:
    use strict; my ($htm_file); opendir(CUR_DIR, ".") || die "Can't open current directory: $!\n"; foreach $htm_file (grep -f && -e && /\.htm$/i, readdir(CUR_DIR)) { open(HTM_FILE, "$htm_file") or die "Can't open $htm_file: $!\n"; while (<HTM_FILE>){ # while there are lines in the file if(/<meta/i){ # if the line is a meta tag, have a look at it /content=\"(.+?)\">/i; # if it contains 'content' grab it print "$1\n"; # here's where you do something with it } } close HTM_FILE or die "Can't close $htm_file: $!\n"; } closedir CUR_DIR;
    Hope this helps. Liam
Re: Meta tag
by nedals (Deacon) on Oct 20, 2005 at 01:06 UTC
    I can successfully take what's between the page title tags, chop off the start/ends and print that out..

    How do you do that currently? Are you using a RegEx?

    A simple expanation of the above can lead to a method to extract the META data that can be incorporated into your existing script.

      Currently, I use -
      if(/<TITLE>/) { chop; $title = $_; $title =~ s/<TITLE>//g; $title =~ s/<\/TITLE>//g; }
      which works well for me, and isn't too complex. If I could get that working for meta tags I'd be happy! :) Cheers for help guys

        Based on your response I conclude that this snippet is inside a while loop. I also conclude that the <title> and <meta> tags are on seperate lines.

        use strict; my ($title,$meta) = ('',''); while (<DATA>) { ## <DATA> for test chomp; # .. if (/<title>/i) { $title = $_; $title =~ s/<title>(.+?)<\/title>/$1/i; } if (/<meta/i) { $meta = $_; $meta =~ s/<meta .+? CONTENT="(.+?)">/$1/i; } #.. } print "RESULT:\n$title\n$meta\n"; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/T +R/html4/strict.dtd"> <html><head> <TITLE>untitled</TITLE> <META NAME="description" CONTENT="An introduction to the basics of Per +l scripting."> </head> <body> </body> </html>