Getting Words out of HTML :)

reyjrar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Getting Words out of HTML :) by jreades (Friar) on Aug 30, 2000 at 03:02 UTC
You can tackle this two ways if all you want/need is quick and dirty: Suck the entire file into a string using an undefined $/ and then try to rip out the keywords, description, and content tags using reg exps before s/<.>//g;. This way tends to be a little brutal and unrefined (and you'd certainly run into trouble with poorly coded HTML) but it does have its advantages if you can reliably predict where the meta tags will appear and how they are formatted. Chew through the file line by line and spit out the meta tags when you find them. This has a couple of issues: If the meta tags aren't at the top of the file you can't really insert them again until the EOF (unless you're assigning everything else to a string and will print it out later) You need to watch out for HTML tags that don't terminate on the same line as they were started (there are surprisingly many sites that break tags across lines). To do this you would need to keep track of whether a tag is 'open' or 'closed' when you reach the end of a line. At the beginning of the next line you check your 'open' variable and, if open, s/^^>>//; It does, however, have advantage that the reg exps become substantially easier I don't have any experience with HTML::Parse, so YMMV -- for more complicated HTML I'd go with HTML::Parse as they've probably thought through the issues much more clearly than I have. But for simple HTML and a limited number of files you might be able to just script it in less than 20 lines. Hope this helps. Below is a version that works -- it's not terrible sophisticated, but it will catch most tags and not mangle your page too badly. #!/usr/local/perl5 $OUT = "/u/jreades/"; undef $/; while (my $file = shift) { open(IN, "<" . $file) or die ("Couldn't open file (" . $file . ") +to read: " . $!); open(OUT, ">" . $OUT . "test.html") or die ("Couldn't open file (" + . $OUT . $file . ") to write: " . $!); print STDOUT "Reading: " . $file . "\n"; print STDOUT "Writing: " . $OUT . "test.html\n"; my $keywords, $description, $tag; my $text = <IN>; while ($text =~ s/<([^>]+)>//) { my $tag = $1; if ($tag =~ /NAME="(keywords\|description)/i) { my $meta = $+; (${$meta}) = $tag =~ /CONTENT="([^"]*)"/i; print OUT $meta . ": " . ${$meta} . "\n\n"; } } $text =~ s/\n{3,}/\n/g; print OUT $text; close IN; close OUT; } exit 0; [download] <CODE>	[reply] [d/l]
RE: Re: Getting Words out of HTML :) by jreades (Friar) on Aug 30, 2000 at 03:05 UTC
Of course, now I see you're doing something with dbs too, and my script definitely isn't powerful enough to handle those requirements. Oh well, maybe someone else will find it useful... :^P	[reply]
RE: RE: Re: Getting Words out of HTML :) by jreades (Friar) on Aug 30, 2000 at 03:12 UTC
Uhhhhh, without, of course, the extra $tag declaration...	[reply]
Re: Getting Words out of HTML :) by merlyn (Sage) on Aug 30, 2000 at 00:59 UTC
Perhaps you could post what you have, and we could help by critiquing. As for some already solutions, I imagine you could punch Google up for some `HTML::Parse` examples. -- Randal L. Schwartz, Perl hacker	[reply]
RE: Re: Getting Words out of HTML :) by reyjrar (Hermit) on Aug 30, 2000 at 02:25 UTC
so far I have this: HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "self,tagname +,attr"], end => [\&tag_end, "self,tag +name,attr"], text => [\&text, "'$WEIGHT',d +text"] ], marked_sections => 1, )->parse($DATA) \|\| die "Huh $!\n"; then my three subs: sub tag { my $self = shift; my $tagname = shift; my $attr = shift; my $stuff; if($tagname eq "meta") { if($attr{'name'} eq ("keywords" \|\| "description")) { $stuff = +$attr{'content'}; &text($WEIGHT, $stuff); } } elsif($tagname eq "title") { $WEIGHT = "2"; } } sub tag_end { my $self = shift; my $tagname = shift; my $attr = shift; if($tagname eq "title") { $WEIGHT = "1"; } } [download] &text($weight,$text) just breaks up each word passed($text) to it and throws it into the database, incrementing the count by the $weight. that function is working.. (for the most part) any ideas?	[reply] [d/l]
RE: RE: Re: Getting Words out of HTML :) by reyjrar (Hermit) on Aug 30, 2000 at 19:16 UTC
Now, I think I might have a deeper understanding of the HTML:Parse module. I just needed a night to sleep and not stupify myself with the company's "Perl For Programmer's" classes that I've been in for the past few days. (Generic equivalent of "Multiplication for PHD's of Mathematics"). So here's what I hacked out.. testing it now, but since I couldn't find too many good examples off google, that I'd throw this up here and hope someone understands it better. ### # Parse the URLs my $ua = new LWP::UserAgent; $ua->agent("AgentName/0.1 " . $ua->agent); my $req = new HTTP::Request GET => $URL; $req->content_type('application/x-www-form-urlencoded'); my $res = $ua->request($req); if ($res->is_success) { $DATA = $res->content; } $WEIGHT = 1; HTML::Parser->new(api_version => 3, handlers => [start => [\&tag, "self,tagname +,attr"], end => [\&tag_end, "self,tag +name,attr"], text => [\&text, "'$WEIGHT',d +text"] ], marked_sections => 1, )->parse($DATA) \|\| die "Huh $!\n"; .... # Parsing subroutines.. sub tag { my $self = shift; my $tagname = shift; my $attr = shift; my $stuff; $inside{$tagname} += 1; if($tagname eq "meta") { if($attr{'name'} eq ("keywords" \|\| "description")) { $stuff = +$attr{'content'}; &text($WEIGHT,$stuff); } } elsif($tagname eq "title") { $WEIGHT = "2"; } } sub tag_end { my $self = shift; my $tagname = shift; my $attr = shift; $inside{$tagname} -= 1; if($tagname eq "title") { $WEIGHT = "1"; } } sub text { my $weight = shift; my $test_to_parse = shift; #do whatever we want to do.. } [download] This isn't done, but maybe it'll help someone who is looking around for this too :) thanks for all your help and suggestions.. -brad..	[reply] [d/l]