comment on

I'll do my best to get this right though I warn this is not an Intrepid approved post.

I have 40,000 mostly HTML files that are generally displayed through a perl script (usually one at a time). Many of these files seem to contain, what looks like, nasty unicode characters that browsers tend to render as boxes, question marks or their flavor for "I can't print this character" of the week.

I'm trying to scrub out these nasty unicode characters, I'm using (with success) $input =~ s/[^\x00-\x7F^\xA1-\xFF]/\ /g;

This seems to work fairly well but it means I lose characters in \xA0 - \xFF range -- which is unfortunate because I'd rather convert those to their HTML equivalent. (So its resumé instead of resum )

I came up with two techniques for this that I *thought* would work but I cannot find an acceptable syntax.

1. Search for the high-range codes where there are HTML equivalents (\xA0-\xFF), decode into decimal and place appropriate HTML pre- and suffix (i.e. \xE9 becomes é) $input =~ s/([\xA0-\xFF])/&#ord($1);/gie;

That will fail, because it's trying to evaluate &# and it can't. That's my problem... and maybe it's a very novice issue but don't know how to get the &# prefix and the ; suffix in there, I've tried a dozen methods but all of them are wrong.

2. bobf kindly put me onto http://search.cpan.org/~gaas/HTML-Parser-3.60/lib/HTML/Entities.pm -- I tried using encode_entities($input, "\xA0-\xFF"); (I tried the decimal equivalent as well) but no love. If I simply use encode_entities it likes to eat the < and > tags (obviously) and that's bad for all the HTML.

Option 2 seems like a more reasonable solution than my hack but I don't know how to modify it for my purposes. Sorry for the long post but in prepping this I didn't want to be guilty of the XY problem. Have a good eve.

Thanks.

update: So a bit more on the architecture at work here as I begin to try out some of the solutions. The system processes web-posted and email-posted messages (been doing it basically the same way since 2001), I didn't write it and only have a cursory understanding of how it works. Messages get posted in a flat-file database system, for each message a perl file is created to hold the text. There is some minimal processing on the characters before being stored.

From there an interface provides access to each file when called. It does some minimal processing. It attempts to keep most of the formatting from the original message as these are collections of stories so the text formatting can be vital to presentation. Sometimes it doesn't work so well because the website's templates are black-backgrounded and the vast majority of those processed emails were on white-backgrounds. Nevertheless, usually just changing black to white is all that is necessary.

The system has been effective for 8 years, but recently more and more garbage characters are ending up in the final product. It's ugly, distracting and detracts from the content. In other words, the bane of my meager no-pay web programming existence.

In reply to Removing Unsafe Characters by Praethen

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.