Is this a HTML::TokeParser::Simple bug I should send along to Ovid, or does the problem exist between chair and keyboard?

Summary of problem: Some text tokens are getting split into two tokens. Sample test case below

#!/usr/bin/perl -w· use strict; use HTML::TokeParser::Simple; my $html = q( <option value="STAFE">STAFE - 900 - BEN PROG - Food Assistance</option +> <option value="STAM7">STAM7 - 900 - BEN PROG - Med Asst - Lynchbrg</op +tion> <option value="STAM8">STAM8 - 900 - BEN PROG - Med Asst - Marion</opti +on> <option value="STAM9">STAM9 - 900 - BEN PROG - Med Asst - Petrsbrg</op +tion> <option value="STAMA">STAMA - 900 - BEN PROG - Medical Assistance</opt +ion> <option value="STATA">STATA - 900 - BEN PROG - Economic Assistance</op +tion> <option value="STATR">STATR - 900 - BEN PROG - Training Development</o +ption>); my $p = HTML::TokeParser::Simple->new(\$html); while(my $token = $p->get_token){ if($token->is_text){ my $text = $token->return_text; next unless $text =~ /\S/; print "[$text]\n"; } }
produces as output:
[STAFE - 900 - BEN PROG - Food Assistance] [STAM7 - 900 - BEN PROG - Med Asst - Lynchbrg] [STAM8 - 900 - BEN PROG - Med Asst - Marion] [STAM9 - 900 - BEN PROG - Med Asst - Petrsbrg] [STAMA - 900 - BEN PROG - Medical Assistance] [STATA - 900 - BEN PROG - Economic Assistance] [STATR - 900 - BEN PROG - Training] [ Development]
I have no idea why the "Development" ends up on it's own line. This is the smallest sample from my data that gave these results -- adding more data "moves" the problem, but the problem still exists. Taking out the space between "Training" and "Development" in the data makes the new compound word the one that goes to its own line.

It's acting as if some buffer length is interfering, but it isn't just the token length (making a longer text token will change the position of the problem, but it doesn't necessarily hit the longest token -- in fact, in this sample set, it continues to hit the last text token somewhere.)

I've skimmed the docs for HTML::Parser, (I've used 3.25 and 3.26) HTML::TokeParser(2.24) , and HTML::TokeParser::Simple(1.4). I've tried in on different boxes. (one aging SuSE box with 5.6.0, one Debian unstable with 5.8.0) any ideas?


In reply to Possible HTML::TokeParser::Simple Bug by swiftone

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.