theguvnor has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use HTML::TokeParser to parse the input elements in an HTML form. Eventually I want to be able to add my own non-standard attributes to control some other features but that's not important at this time.

The following code looks for input, textarea, and select tags. What I'm having trouble with is the loop that is executed if the tag found is a select, to get the option nodes immediately following. It seems to exit the main while() loop as soon as it finishes the inner loop. It doesn't seem to get the final input form element (named text2)!!

Anyone who can tell me why it silently exits the main loop would be appreciated.

I hope this is enough to illustrate:

Code:

use strict; use warnings; use diagnostics; use HTML::TokeParser; use Data::Dumper; # variables my %rules; my $count; my $DEBUG = 1; # get filename of HTML form from commandline and open filehandle to it my $formfile = shift or die "Usage: perl $0 filename\n"; open (my $fh, '<', $formfile) or die "Trouble opening your file!\n$!"; my $parser = HTML::TokeParser->new($fh); while (my $token = $parser->get_tag(qw(input textarea select))) { $count++; my $tag = $token->[0]; my $type = $token->[1]{'type'};# or warn "$count-th Token ($tag ta +g) has no type!"; my $name = $token->[1]{'name'} or warn "$count-th Token ($tag tag) + has no name!"; my $value = $token->[1]{'value'}; my $maxlength = $token->[1]{'maxlength'}; my $required = $token->[1]{'required'}; # non-w3c attribute my $allowed; if ($tag =~ m/select/i) { while (my $option = $parser->get_tag('option')) { push @{$allowed}, $option->[1]{'value'}; } } else { $allowed = [ $token->[1]{'allowed'} ]; # non-w3c attrib +ute } $DEBUG && print "$count\t$tag\t$name\t$type\n"; # $rules{ $name } = sub { print $name, $/; return; }; $rules{$name} = { 'name' => $name, 'type' => $type, 'value' => $value, 'allowed' => $allowed, 'required' => $required, 'maxlength' => $maxlength }; } close $fh; if ($DEBUG and %rules) { open OUT, '>', "$formfile.rules.txt" or die $!; print OUT Dumper(\%rules); close OUT; } exit;

And here is the small sample html form file I created to illustrate the problem:

<html><head></head><body><form action="/"><P>text 1: <INPUT name=text1 +></P><P>textarea: <TEXTAREA name=textarea1 cols=30></TEXTAREA></P> <P><INPUT type=radio value=radio1option1 name=radio1>&nbsp;radio1 opti +on 1 <INPUT type=radio value=radio1option2 name=radio1>&nbsp;radio1 o +ption 2</P> <P><INPUT type=checkbox value=check1option1 name=check1>check 1 option + 1 <INPUT type=checkbox value=check1option2 name=check1>check 1 optio +n 2</P><P>list box:</P><P><SELECT size=3 name=list1> <OPTION value=1> +list1 option1</OPTION> <OPTION value=2>list1 option2</OPTION> <OPTION + value=3>list1 option3</OPTION></SELECT> </P> <P>text 2: <INPUT maxLength=30 size=30 name=text2></P></FORM></body></ +html>

Thanks!

[Jon]

Replies are listed 'Best First'.
Re: A question about HTML::TokeParser
by Jenda (Abbot) on Oct 12, 2003 at 18:58 UTC

    I've never used HTML::TokeParser myself, but I think I know what's the problem. The $parser->get_tag('option') in the inner loop doesn't care about the </select> and tries to give you all <option>s it can find in the rest of the file and when it at last returns undef, the HTML::TokeParser's "cursor" is at the end of the HTML. Therefore there are no more tags to find.

    I believe you'll have to do it differently. I think you'll have to have just one look looking for any <input>, <textarea>, <select> or <option>, remember the name of the last seen <select> and append any found <option> to that <select>.

    HTH, Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

Re: A question about HTML::TokeParser
by PodMaster (Abbot) on Oct 13, 2003 at 01:49 UTC
    I suggest you look into a wheel called HTML::Form.

    The reason text2 is not being displayed is because you read until the end of file looking for option tags in the inner loop (logic flaw), example:

    use HTML::TokeParser; my $p = HTML::TokeParser->new(\q[ <bold> <body> ]); use Data::Dumper; while(defined(my $t = $p->get_tag('bold'))){ print Dumper($t); } my $t = $p->get_token() ; print "no more tokens, see " . ( defined $t ? Dumper($t) : "undef" ); __END__ $VAR1 = [ 'bold', {}, [], '<bold>' ]; no more tokens, see undef

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: A question about HTML::TokeParser
by Roger (Parson) on Oct 13, 2003 at 02:29 UTC
    Your source of error has been identified by others, ie., the token parser is not scoped/recursive, and your inner loop looking for options caused side effect on the outer loop, which exited prematurely.

    Being the HTML Token Parser, it's good at parsing tokens. :-) You could rewrite your loop using the token parser, instead of the get_tag.

    ... while (my $token = $parser->get_token) { next unless $token->[1] =~ /(?:select|input|textarea)/; if ($token->[0] eq 'S') # start tag { $count++; my $tag = $token->[1]; my $name = $token->[2]{name}; # fetch name of input my $value = $token->[2]{value}; my $maxlength = $token->[2]{maxlength}; my $required = $token->[2]{required}; my $allowed; if ($tag eq 'select') { while (my $option = $parser->get_token) { last if $option->[0] eq 'E' && $option->[1] eq 'select'; next unless $option->[0] eq 'S' && $option->[1] eq 'option'; push @{$allowed}, $option->[2]{value}; } } else { $allowed = [ $token->[2]{allowed} ]; } $DEBUG && print "$count\t$tag\t$name\t\n"; if ($tag eq 'select') { print Dumper($allowed); } } ... }
    And the debug output shows:
    1 input text1 2 textarea textarea1 3 input radio1 4 input radio1 5 input check1 6 input check1 7 select list1 $VAR1 = [ '1', '2', '3' ]; 8 input text2

      First thanks to everyone who responded. Secondly, apologies for popping up to ask a question and then not responding sooner.. had only intermittent access over the Canadian Thanksgiving weekend. Thirdly, an extra ++ to Roger for providing a working re-write. After seeing the first couple responses, I had begun to think (again, had only limited access to actually play on the weekend) of how I could maybe use the get_token method, but wasn't sure - you provided some proof.

      Thanks again to everyone!

      [Jon]

Re: A question about HTML::TokeParser
by pg (Canon) on Oct 12, 2003 at 18:02 UTC
    For what you are doing, a better way might be using HTTP::Request, and steal some code from HTTP::Daemon see how it creates HTTP::Request object base on what is received.