jqcoffey has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour Monks,

At the risk of being flamed, I am posting a follow up question to an earlier thread, found here.

After many iterations, and taking the advice I was offered, I ended up with what I think is a pretty solid piece of code.

The intent of this test script is to take a fully (mal)formed HTML document and attempt to tag each word (non tag) with a starting byte position. This is to be shoved into an array of hash refs for later use in a Javascript UI'ed spell checker. For full details on the final goal of the project you can read this node.

At this point, I think, the code is working fairly well, but would appreciate a bit of peer review.

Further, I'd also like to know if this is something worth while for the rest of the Perl development community and if I should work on actually subclassing HTML::TokeParser and offering up my first CPAN module. I'll be modularizing this code for our own purposes, anyway, and I wouldn't mind giving something back to the Perl community.

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser; use Data::Dumper; my $html_file = './test.html'; my $html = ''; open(F,"<$html_file"); while (<F>) { $html .= $_; } close(F); my $word_to_repl = $ARGV[0] || 0; chomp $word_to_repl; my $p = HTML::TokeParser->new( \$html ); # setup text position info for TokeParser. The char is # the token type and the int is the position in the resulting # array of the unmanipulated text--which is what we want to # inspect. my $text_pos = {'S' => 4, 'E' => 2, 'T' => 1, 'C' => 1, 'D' => 1, 'PI' => 2 }; my $base_count = 0; my @word_list = (); while (my $token = $p->get_token) { my $token_type = $token->[0] || ''; my $token_pos = $text_pos->{$token_type} || ''; # die hard if we have any sort of parsing error, as everything # is likely screwed as a result, anyway. if (!$token_type || !$token_pos) { print "Ouch.. parsing error!\n"; exit 0; } if ($token_type eq 'T') { # got text, run a regex with positional counts my $text = $token->[$token_pos]; # regex grabs all words out of $text. It *also* grabs + HTML &nnnn; type # special chars complete with the & and ; so I can ski +p them. The # "\w+\'?\w+" bit allows me to grab contracted words ( +eg don't), but causes # a failure in finding single letter words ("I" and "a +"). while ($text =~ m/(\&?\b\w+\'?\w+?\b\;?)/g) { # skip if this is a &nnnn; style HTML char if ($1 !~ /^\&/) { # start byte is the summation of base_ +count and where # this regex started off. my $start = $base_count + $-[0]; push @word_list, { word => $1, start = +> $start }; } } } # increment base_count with the length of this segment $base_count += length($token->[$token_pos]); } print "Original HTML:\n"; print "----------------------------------\n"; print "$html\n\n"; my $word_href = $word_list[$word_to_repl]; my $start = $word_href->{start}; my $word = $word_href->{word}; my $offset = length($word); print "Replacing [$word] at ($start,$offset)\n\n"; substr($html,$start,$offset,'POOP'); print "New HTML:\n"; print "----------------------------------\n"; print "$html\n\n";

This test script expects an html file in the pwd called test.html, as written. It also accepts an int as an argument for the word to replace.

Thanks,
Justin