Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Please advise how I could find specific info on my HTML(htm and html extensions). I need the info that is located between the Title tags:
<TITLE>TITLE OF MY HTML PAGE</TITLE>
The output should print the name of the html file and the Title info like this:
mainpage.html TITLE OF MY HTML PAGE secondpage.html Another Title in an HTML Pa +ge
Here is my attempt at it and was hoping someone could help get this working correctly.
use File::Find; sub wanted { if( -f $_ = '*.htm* ) { open ( F, $_ ) or die; while( defined( $line = <F> ) ) { if( $line =~ /<TITLE>(\w+)</TITLE>/i) { print "FILE = $_ and TITLE = $1\n"; } elsif( $line =~ /<title>(.*$)<title>/i) { print "FILE = $_ and TITLE = $1\n"; } } close F; } } find( \&wanted, "." );

Replies are listed 'Best First'.
Re: Finding data
by Chady (Priest) on Feb 28, 2002 at 17:46 UTC

    try HTML::TokeParser:

    here's and example:

    require HTML::TokeParser; my $obj = HTML::TokeParser->new($File); my $token = $obj->get_tag("title"); my $title_text = $token->get_text("/title");

    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/
Re: Finding data
by dragonchild (Archbishop) on Feb 28, 2002 at 17:46 UTC
    Heh. Your regex's are the problem.

    First off, you should be use some sort of HTML parser. What if my HTML file is of the form:

    <TITLE> Blahblah </title>
    You'd never find "Blahblah".

    Secondly, the fix you're looking for is

    if ($line =~ /<TITLE(\w+)<\/TITLE>/i) ---- if ($line =~ m#<TITLE>(\w+)</TITLE>#i)
    Note the backslash in front of the slash for the first one and the different regex delimiters for the second.

    I have no idea what you're trying to do in your second regex. Why are you looking for the end of line with '$' before your line is done?

    Also, use indentation. Your code should look something like:

    use File::Find; sub wanted { if( -f $_ = '*.htm* ) { open ( F, $_ ) or die $/, $/; while( defined( $line = <F> ) ) { if($line =~ /<TITLE>(\w+)</TITLE>/i) { print "FILE = $_ and TITLE = $1\n"; } elsif( $line =~ /<title>(.*$)<title>/i) { print "FILE = $_ and TITLE = $1\n"; } } close F; } } find( \&wanted, "." );
    See how much easier that is to read?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      I am still lost on why this isnt working. Please advise more.
Re: Finding data
by strat (Canon) on Feb 28, 2002 at 17:49 UTC
    sub wanted { if ( /\.html?$/ ){ # if .htm or .html local $/; # throw away stdin-separator unless( open (HTML, $File::Find::name) ){ warn "Error couldn't read $File::Find::name: $!\n"; } # unless else { my $line = <HTML>; # slurp whole file close (HTML); my ($title) = m|<TITLE>(.*?)</TITLE>|is; # do with $title whatever you want... } # else } # if .html? } # wanted ...
    I haven't tested this code, but hope that it will work.
    /<TITLE>\w+</TITLE>/ will give a syntax error, because of / as separat +or and </T not escaped<BR> and \w+ will never match a space...

    Best regards,
    perl -le "s==*F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"