comment on

Bonjour Monks,

At the risk of being flamed, I am posting a follow up question to an earlier thread, found here.

After many iterations, and taking the advice I was offered, I ended up with what I think is a pretty solid piece of code.

The intent of this test script is to take a fully (mal)formed HTML document and attempt to tag each word (non tag) with a starting byte position. This is to be shoved into an array of hash refs for later use in a Javascript UI'ed spell checker. For full details on the final goal of the project you can read this node.

At this point, I think, the code is working fairly well, but would appreciate a bit of peer review.

Further, I'd also like to know if this is something worth while for the rest of the Perl development community and if I should work on actually subclassing HTML::TokeParser and offering up my first CPAN module. I'll be modularizing this code for our own purposes, anyway, and I wouldn't mind giving something back to the Perl community.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TokeParser;
use Data::Dumper;


my $html_file = './test.html';

my $html = '';
open(F,"<$html_file");
while (<F>) {
        $html .= $_;
}
close(F);

my $word_to_repl = $ARGV[0] || 0;
chomp $word_to_repl;

my $p = HTML::TokeParser->new( \$html );

# setup text position info for TokeParser.  The char is
# the token type and the int is the position in the resulting
# array of the unmanipulated text--which is what we want to
# inspect.
my $text_pos = {'S'     => 4,
                'E'     => 2,
                'T'     => 1,
                'C'     => 1,
                'D'     => 1,
                'PI'    => 2 };

my $base_count = 0;
my @word_list = ();
while (my $token = $p->get_token) {
        my $token_type = $token->[0] || '';
        my $token_pos  = $text_pos->{$token_type} || '';

        # die hard if we have any sort of parsing error, as everything
        # is likely screwed as a result, anyway.
        if (!$token_type || !$token_pos) {
                print "Ouch.. parsing error!\n";
                exit 0;
        }

        if ($token_type eq 'T') {
                # got text, run a regex with positional counts
                my $text = $token->[$token_pos];

                # regex grabs all words out of $text.  It *also* grabs
+ HTML &nnnn; type
                # special chars complete with the & and ; so I can ski
+p them.  The
                # "\w+\'?\w+" bit allows me to grab contracted words (
+eg don't), but causes
                # a failure in finding single letter words ("I" and "a
+").
                while ($text =~ m/(\&?\b\w+\'?\w+?\b\;?)/g) {
                        # skip if this is a &nnnn; style HTML char
                        if ($1 !~ /^\&/) {
                                # start byte is the summation of base_
+count and where
                                # this regex started off.
                                my $start = $base_count + $-[0];
                                push @word_list, { word => $1, start =
+> $start };
                        }
                }
        }

        # increment base_count with the length of this segment
        $base_count += length($token->[$token_pos]);
}


print "Original HTML:\n";
print "----------------------------------\n";
print "$html\n\n";

my $word_href   = $word_list[$word_to_repl];
my $start       = $word_href->{start};
my $word        = $word_href->{word};
my $offset      = length($word);

print "Replacing [$word] at ($start,$offset)\n\n";
substr($html,$start,$offset,'POOP');

print "New HTML:\n";
print "----------------------------------\n";
print "$html\n\n";
[download]

This test script expects an html file in the pwd called test.html, as written. It also accepts an int as an argument for the word to replace.

Thanks,
Justin

In reply to Byte Position of Words in an HTML Document by jqcoffey

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.