Re: Getting Words out of HTML :)

You can tackle this two ways if all you want/need is quick and dirty:

Suck the entire file into a string using an undefined $/ and then try to rip out the keywords, description, and content tags using reg exps before s/<.*>//g;. This way tends to be a little brutal and unrefined (and you'd certainly run into trouble with poorly coded HTML) but it does have its advantages if you can reliably predict where the meta tags will appear and how they are formatted.
Chew through the file line by line and spit out the meta tags when you find them. This has a couple of issues:
1. If the meta tags aren't at the top of the file you can't really insert them again until the EOF (unless you're assigning everything else to a string and will print it out later)
2. You need to watch out for HTML tags that don't terminate on the same line as they were started (there are surprisingly many sites that break tags across lines). To do this you would need to keep track of whether a tag is 'open' or 'closed' when you reach the end of a line. At the beginning of the next line you check your 'open' variable and, if open, s/^^>*>//;
3. It does, however, have advantage that the reg exps become substantially easier

I don't have any experience with HTML::Parse, so YMMV -- for more complicated HTML I'd go with HTML::Parse as they've probably thought through the issues much more clearly than I have. But for simple HTML and a limited number of files you might be able to just script it in less than 20 lines.

Hope this helps.

Below is a version that works -- it's not terrible sophisticated, but it will catch *most* tags and not mangle your page too badly.

#!/usr/local/perl5

$OUT = "/u/jreades/";

undef $/;

while (my $file = shift) {
    open(IN, "<" . $file) or die ("Couldn't open file (" . $file . ") 
+to read: " . $!);
    open(OUT, ">" . $OUT . "test.html") or die ("Couldn't open file ("
+ . $OUT . $file . ") to write: " . $!);
    print STDOUT "Reading: " . $file . "\n";
    print STDOUT "Writing: " . $OUT . "test.html\n";

    my $keywords, $description, $tag;

    my $text = <IN>;

    while ($text =~ s/<([^>]+)>//) {

        my $tag = $1;

        if ($tag =~ /NAME="(keywords|description)/i) {
            my $meta = $+;

            (${$meta}) = $tag =~ /CONTENT="([^"]*)"/i;
            print OUT $meta . ": " . ${$meta} . "\n\n";
        }
    }
    $text =~ s/\n{3,}/\n/g;
    print OUT $text;
    close IN;
    close OUT;
}

exit 0;
[download]

<CODE>

Comment on Re: Getting Words out of HTML :) Download Code

Replies are listed 'Best First'.
RE: Re: Getting Words out of HTML :) by jreades (Friar) on Aug 30, 2000 at 03:05 UTC
Of course, now I see you're doing something with dbs too, and my script definitely isn't powerful enough to handle those requirements. Oh well, maybe someone else will find it useful... :^P	[reply]
RE: RE: Re: Getting Words out of HTML :) by jreades (Friar) on Aug 30, 2000 at 03:12 UTC
Uhhhhh, without, of course, the extra $tag declaration...	[reply]