comment on

You can tackle this two ways if all you want/need is quick and dirty:

Suck the entire file into a string using an undefined $/ and then try to rip out the keywords, description, and content tags using reg exps before s/<.*>//g;. This way tends to be a little brutal and unrefined (and you'd certainly run into trouble with poorly coded HTML) but it does have its advantages if you can reliably predict where the meta tags will appear and how they are formatted.
Chew through the file line by line and spit out the meta tags when you find them. This has a couple of issues:
1. If the meta tags aren't at the top of the file you can't really insert them again until the EOF (unless you're assigning everything else to a string and will print it out later)
2. You need to watch out for HTML tags that don't terminate on the same line as they were started (there are surprisingly many sites that break tags across lines). To do this you would need to keep track of whether a tag is 'open' or 'closed' when you reach the end of a line. At the beginning of the next line you check your 'open' variable and, if open, s/^^>*>//;
3. It does, however, have advantage that the reg exps become substantially easier

I don't have any experience with HTML::Parse, so YMMV -- for more complicated HTML I'd go with HTML::Parse as they've probably thought through the issues much more clearly than I have. But for simple HTML and a limited number of files you might be able to just script it in less than 20 lines.

Hope this helps.

Below is a version that works -- it's not terrible sophisticated, but it will catch *most* tags and not mangle your page too badly.

#!/usr/local/perl5

$OUT = "/u/jreades/";

undef $/;

while (my $file = shift) {
    open(IN, "<" . $file) or die ("Couldn't open file (" . $file . ") 
+to read: " . $!);
    open(OUT, ">" . $OUT . "test.html") or die ("Couldn't open file ("
+ . $OUT . $file . ") to write: " . $!);
    print STDOUT "Reading: " . $file . "\n";
    print STDOUT "Writing: " . $OUT . "test.html\n";

    my $keywords, $description, $tag;

    my $text = <IN>;

    while ($text =~ s/<([^>]+)>//) {

        my $tag = $1;

        if ($tag =~ /NAME="(keywords|description)/i) {
            my $meta = $+;

            (${$meta}) = $tag =~ /CONTENT="([^"]*)"/i;
            print OUT $meta . ": " . ${$meta} . "\n\n";
        }
    }
    $text =~ s/\n{3,}/\n/g;
    print OUT $text;
    close IN;
    close OUT;
}

exit 0;
[download]

<CODE>

In reply to Re: Getting Words out of HTML :) by jreades
in thread Getting Words out of HTML :) by reyjrar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.