There I was, slurping data from web pages near and far, slinging regexes and treebuilders, elements and parsers, laughing at the remarkable ease with which Perl and my armada of CPAN modules untangled the gnarliest messes of unstructured data. Indeed, all was going quite well, until I spent several hours belaboring what reduced to the following code:
$str = 'a b c o o p s h u h ? '; print $& while $str =~ /.\s/g;
In a sane world, this would have spit out "a b c o o p s h u h ?" and I could have gone along my merry way. Ah, but would fun would that have been? Instead of that alltogether predictable and boring output, what I got was "a b c h u h ?"

To witness this result for yourself, you will probably have to click the link to download the code (though you can copy-paste it under Opera, which makes it seem all the more spooky, you can't under IE, and I'm not sure about Mozilla...) Actually, you may not see a problem with this code at all, depending on your screen font.

If you've been paying attention, you've quite probably figured out what my problem was. Some of the spaces in $str were not actually spaces. They looked like spaces, but they were actually ASCII character 0xA0's.

How did that character get there? In my code, $str came from a parsed web page. "What kind of deranged webmonkey would use high-bit ASCII characters masquerading as spaces in their HTML?" I wondered. I checked the source of the page and it was not possessed with any such evil characters. It did, however, have the seemingly innocuous entities, ' ' where the evil spaces were in my parsed HTML.

"Aha! I've discovered a bug in the HTML parser!", I happily exclaimed. Tracing through the code of this module lead me to HTML::Entities, wherein I saw that ' ' was indeed decoded as character 0xA0.

The following snippet demonstrates this behaviour quite well (copy and paste at your leisure):

use HTML::Entities; my $html = 'whoa dude whats going on with this line'; my $text = HTML::Entities::decode_entities($html); print "$text\n"; ## prints: whoa dude whats going on with this line print $& while $text =~ /\w+\s+/g; ## prints: whoa dude whats with this
So was this a bug? As it turns out, ' ' is decoded exactly as it should be according to HTML specs, into ASCII character 0xA0. This is not the space many of us know, love, and expect. This is a wanton doppelganger space, which looks like a space, copies like a space, pastes like a space, and spaces like a space, but is not, in any true sense, a space.

I don't very much care for this "non-breaking space", as it's called. My meditation, feeble thought it may be, is this: unless you should have some specific want or need of this bastard space, exorcise it early (tr/\240/ /) from all of your entity-decoded inbound HTML.

   MeowChow                                   
               s aamecha.s a..a\u$&owag.print

In reply to Spaced Out by MeowChow

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.