Re: stripping characters from html

If your goal is to create an XML output whose content is an imperfect and incomplete copy of the original HTML text data (i.e. with an indeterminate amount of corruption due to loss of content), then a "generic approach" for implementing what almut aptly calls the "last resort" solution is a simple regex, applied to the HTML text data:

s/[^\x00-\x7f]+//g;
[download]

That is, every byte/character outside the ASCII range will be deleted, regardless whether your perl script happens to be handling the data as bytes or as characters.

A better approach would be to understand what the character encoding of the incoming HTML data really is (and watch out for those HTML character entities that turn into non-ascii characters, like ™ é   and so on). Make sure you do everything necessary to turn the text into "pure" utf8 strings (using HTML::Entities::decode_entities), and then output the XML with proper utf8 encoding, or else convert all non-ascii characters to their numeric character entities, as almut suggested above.

There's probably a module for converting characters to numeric entities, but the basic process is:

s/([^\x00-\x7f])/sprintf("&#%d;",ord($1))//eg;
[download]

(update: added a missing "#" in the sprintf format string)

But personally, I prefer having XML files with utf8 text in them.

In either case, perl has to know that the string contains utf8 characters, so it can treat it as (multi-byte) characters, rather than as bytes. And that means that you've read the data from a file handle using a ":utf8" IO layer, or that you've used Encode::decode to convert the text to utf8.

Comment on Re: stripping characters from html Select or Download Code

Replies are listed 'Best First'.
Re^2: stripping characters from html by Your Mother (Archbishop) on Aug 03, 2010 at 21:07 UTC
I agree with keeping the stuff utf8, etc. `s/[^\x00-\x7f]+//g;` may be more readable as (what I believe is the equivalent POSIX class)- `s/[^[:ascii:]]+//g;`	[reply] [d/l] [select]
Re^3: stripping characters from html by graff (Chancellor) on Aug 04, 2010 at 02:55 UTC
... may be more readable as (what I believe is the equivalent POSIX class) ... Right -- and I totally agree (and yes I'm pretty sure the POSIX expression is equivalent). But "more readable" can be different things to different people; e.g. a specific numeric range can lead to less uncertainty or doubt, compared to having to recall the exact syntax and meaning of an expression consisting of extra punctuation around a term that tends to be misused or misunderstood by less experienced programmers...	[reply]
Re^4: stripping characters from html by GrandFather (Saint) on Aug 04, 2010 at 09:11 UTC
In this case your argument seems to better support a "more writeable" thesis than the "more readable" thesis you seem to be propounding. I find the POSIX version much more readable than the character range alternative, although I'd be very unlikely to write the POSIX version for exactly the "recall the exact syntax" issue you mention (combined of course with innate laziness). True laziness is hard work	[reply]
Re^4: stripping characters from html by Anonymous Monk on Aug 05, 2010 at 15:40 UTC
I agree w/ Grandfather on this one. While the 1st version may accurately specify the range, without additional comment it does not convey the purpose of the range. The 2nd version has the advantage of advertising the purpose of the range.	[reply]