Finding data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Please advise how I could find specific info on my HTML(htm and html extensions). I need the info that is located between the Title tags:

                         <TITLE>TITLE OF MY HTML PAGE</TITLE>
[download]

The output should print the name of the html file and the Title info like this:

                         mainpage.html      TITLE OF MY HTML PAGE
                         secondpage.html   Another Title in an HTML Pa
+ge
[download]

Here is my attempt at it and was hoping someone could help get this working correctly.

                         use File::Find;

                         sub wanted 
                         {
                         if( -f $_ = '*.htm* ) 
                         {
                         open ( F, $_ ) or die;
                         while( defined( $line = <F> ) ) 
                         {
                         if( $line =~ /<TITLE>(\w+)</TITLE>/i) 
                         {
                         print "FILE = $_  and TITLE = $1\n";
                         }
                         elsif( $line =~ /<title>(.*$)<title>/i)
                         {
                         print "FILE = $_  and TITLE = $1\n";
                         }

                         }
                         close F;
                         }
                         }
                         find( \&wanted, "." );
[download]

Comment on Finding data Select or Download Code

Replies are listed 'Best First'.
Re: Finding data by Chady (Priest) on Feb 28, 2002 at 17:46 UTC
try HTML::TokeParser: here's and example: `require HTML::TokeParser; my $obj = HTML::TokeParser->new($File); my $token = $obj->get_tag("title"); my $title_text = $token->get_text("/title");` [download] He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l]
Re: Finding data by dragonchild (Archbishop) on Feb 28, 2002 at 17:46 UTC
Heh. Your regex's are the problem. First off, you should be use some sort of HTML parser. What if my HTML file is of the form: `<TITLE> Blahblah </title>` [download] You'd never find "Blahblah". Secondly, the fix you're looking for is `if ($line =~ /<TITLE(\w+)<\/TITLE>/i) ---- if ($line =~ m#<TITLE>(\w+)</TITLE>#i)` [download] Note the backslash in front of the slash for the first one and the different regex delimiters for the second. I have no idea what you're trying to do in your second regex. Why are you looking for the end of line with '$' before your line is done? Also, use indentation. Your code should look something like: `use File::Find; sub wanted { if( -f $_ = '.htm ) { open ( F, $_ ) or die $/, $/; while( defined( $line = <F> ) ) { if($line =~ /<TITLE>(\w+)</TITLE>/i) { print "FILE = $_ and TITLE = $1\n"; } elsif( $line =~ /<title>(.$)<title>/i) { print "FILE = $_ and TITLE = $1\n"; } } close F; } } find( \&wanted, "." );` [download] See how much easier that is to read? ------ We are the carpenters and bricklayers of the Information Age.* Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l] [select]
Re: Re: Finding data by Anonymous Monk on Feb 28, 2002 at 18:52 UTC
I am still lost on why this isnt working. Please advise more.	[reply]
Re: Finding data by strat (Canon) on Feb 28, 2002 at 17:49 UTC
`sub wanted { if ( /\.html?$/ ){ # if .htm or .html local $/; # throw away stdin-separator unless( open (HTML, $File::Find::name) ){ warn "Error couldn't read $File::Find::name: $!\n"; } # unless else { my $line = <HTML>; # slurp whole file close (HTML); my ($title) = m\|<TITLE>(.?)</TITLE>\|is; # do with $title whatever you want... } # else } # if .html? } # wanted ...` [download] I haven't tested this code, but hope that it will work. `/<TITLE>\w+</TITLE>/ will give a syntax error, because of / as separat +or and </T not escaped<BR> and \w+ will never match a space...` [download] Best regards, perl -le "s==F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"	[reply] [d/l] [select]