html parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi I'm trying to parse an html file for input and textarea tags and retrieve the name of the field. The problem is that sometime the tag and the name are on different lines so I need to scan more than one line at a time. This was ok by reading two lines until I mismatched the lines and one got skipped since it goes 1 2 3 4 when I want to backtrack and do it 1 2 2 3 3 4. How can I read one line and get the next line, then the next time round read the second line and the third so there is the overlap?

Comment on html parsing

Replies are listed 'Best First'.
Re: html parsing by thelenm (Vicar) on Aug 28, 2002 at 18:58 UTC
If you'd really like to do this robustly, you should look into one of the HTML modules, such as HTML::Parser or HTML::TokeParser. There is also an HTML::TokeParser Tutorial available on PerlMonks. -- Mike `-- just,my${.02}`	[reply]
Re: html parsing by Aristotle (Chancellor) on Aug 28, 2002 at 19:19 UTC
To chime in with the others, you shouldn't parse HTML by hand. There's so much possible variation that you can't possibly get it right in any reasonable timeframe with simple methods (like a couple regexen). Your problem that a tag may span multiple lines is just one of a myriad of cases. You may soon run across a case where the name you're looking for is two lines down from the start of the tag; then you'll have to readjust everything all over. Save yourself the headache of writing a parser and your users the headache of working with a simpleminded one, and just use the appropriate modules. To answer your question though, it's simply a matter of keeping a copy of the current line around, and going through the file line by line. `my $buffer = <FH>; my $line; while(<FH>) { $buffer .= $line = $_; # .. # .. $buffer = $line; }` [download] Makeshifts last the longest.	[reply] [d/l]
Re: html parsing by Zaxo (Archbishop) on Aug 28, 2002 at 19:00 UTC
HTML::Parser is your friend. After Compline, Zaxo	[reply]