That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters.s/[^\x00-\x7f]+//g;
A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like ™ é and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above.
There's probably a module for converting characters to numeric entities, but the basic process is:
(update: added a missing "#" in the sprintf format string)s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;
But personally, I prefer having XML files with utf8 text in them.
In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8.
In reply to Re: stripping characters from html
by graff
in thread stripping characters from html
by jonnyfolk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |