in reply to stripping characters from html
That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters.s/[^\x00-\x7f]+//g;
A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like ™ é and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above.
There's probably a module for converting characters to numeric entities, but the basic process is:
(update: added a missing "#" in the sprintf format string)s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;
But personally, I prefer having XML files with utf8 text in them.
In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: stripping characters from html
by Your Mother (Archbishop) on Aug 03, 2010 at 21:07 UTC | |
by graff (Chancellor) on Aug 04, 2010 at 02:55 UTC | |
by GrandFather (Saint) on Aug 04, 2010 at 09:11 UTC | |
by Anonymous Monk on Aug 05, 2010 at 15:40 UTC |