Parse... then what? (HTML Parsing problems)

Chady has asked for the wisdom of the Perl Monks concerning the following question:

Ok, so english not being my main language has a lot to do with the fact that I first mistook parse to paste, but that's another story...

I totally understand now what is the meaning of parse, but when I tried to put it at work, it cracked.

I was trying to match some tags from inside an HTML document, so I thought, Ok, what's better than HTML::...s, looked at HTML::Parser, and here is something from the perldoc:

The `HTML::Parser' will tokenize an HTML document when the parse() method is called by invoking various callback methods. The document to be parsed can be supplied in arbitrary chunks.

So? what do I do next? I know that it has now parsed it and understood it.. but what do I get?

Let's say I want to fetch a remote html file using LWP::Simple then search the file for the occurence of a certain criteria <h3>foo</h3> and then get everything after that that is in between <EM> tags until I reach an <HR>. I don't think it's hard to do, but I'm not familiar with the way I need to do this, cause I'm not understanding what to do with the parser.

anyone ready to give me a bit of explanation on that?

--Chady

He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

Chady | http://chady.net/

Comment on Parse... then what? (HTML Parsing problems) Select or Download Code

Replies are listed 'Best First'.
(ichimunki) Re: Parse... then what? (HTML Parsing problems) by ichimunki (Priest) on Aug 18, 2001 at 15:51 UTC
I think that for many HTML tasks HTML::TokeParser is a good module, it is a slightly simpler interface to HTML::Parser. When you parse with HTML::Parser you have to have all of your handlers and logic set up before you can call parse(). But HTML::TokeParser will turn the entire HTML document into a stack of tokens that you can shift and unshift as needed. The tokens correspond to each element of your HTML page. Because your program's logic is based on just a few criteria happening in an order, this is the perfect time for that. Here is some pseudocode that would use HTML::TokeParser: `#Get HTML Document #HTML::TokeParser->new( HTML ) #get first Token #until ( TokenTag eq 'h3' and TokenText eq 'foo') # get next Token #if EndOfHTML then exit #until ( TokenTag eq 'em' ) # get next Token #declare EM container #until ( get TokenTag eq 'hr' ) # add TokenText to EM container # get next Token #use EM container` [download] As you can see, this logic is a lot simpler than trying to set up handlers for each of the tags you care about, then trying to manage states-- which is about what you have to do with HTML::Parser.	[reply] [d/l]
Re: Parse... then what? (HTML Parsing problems) by THRAK (Monk) on Aug 20, 2001 at 16:27 UTC
Chady, I'm with ichimunki on this one, use HTML::TokeParser. Here's a basic working snippet of code based on a parser I'm working on. This may be of help to you: #!/usr/local/bin/perl -w ########################################################### # includes ################################################ ########################################################### use strict; use HTML::TokeParser; ################# ### Variables ### ################# my $file_in = 'test.html'; ################## ### Parse HTML ### ################## my $p = HTML::TokeParser->new($file_in) \|\| die "Can't open: $!"; ## while (my $token = $p->get_token) { my $token_type = @$token[0]; start(@$token[1], @$token[4]) if ($token_type =~ /S/i); # Start Ta +g end(@$token[1], @$token[2]) if ($token_type =~ /E/i); # End Tag text(@$token[1]) if ($token_type =~ /T/i); # Text comment(@$token[1]) if($token_type =~ /C/i); # Comment declaration(@$token[1]) if ($token_type =~ /D/i); # Declaration } ########################################################### # SUB's ################################################### ########################################################### ############# ### DTD's ### ############# sub declaration { my ($declaration) = @_; print "DEC: $declaration\n"; } ################ ### Comments ### ################ sub comment { my ($comment) = @_; print "CMT: $comment\n"; } ##################### ### Text Entities ### ##################### sub text { my ($text) = @_; return if ($text =~ /^(\s+)$/); #skip blank lines $text =~ s/\s+/ /g; #kill off big chunks of whitespace $text =~ s/\n//g; #keep text split across lines together print "TEXT: $text\n"; } ################## ### Start Tags ### ################## sub start { my ($tag, $origtext) = @_; chomp $origtext; print "ST: $tag = $origtext\n"; } ################ ### End Tags ### ################ sub end { my ($tag, $origtext) = @_; chomp $origtext; print "ET: $tag = $origtext\n"; } [download] You'll need to add whatever logic to grab what tags you need either in the parsing `while` loop or with one of the sub-routines. -THRAK www.polarlava.com	[reply] [d/l] [select]