(Edited by epoptai on 10/7/03 in reply to 296401)

I've got a problem with the output of Perlmonks' chatterbox xml ticker. When a high-bit ascii character like 'á' is entered in CB the character is not encoded, it's transmitted with the XML stream in a way that causes XML::Simple to die (as expected when receiving bad xml). It would be best if 'legal' xml were generated by perlmonks, but that's not the case so it needs to be dealt with. I don't know much about this subject, and have been using the following code from jcwren to convert the problem characters into underscore:

$xml =~ s/[\r\n\t]//g; $xml =~ tr/\x80-\xff/_/; $xml =~ tr/\x00-\x1f/_/;
That's very effective, but leaves something to be desired: the character behind the underscore. Since these characters can be detected and underscored, surely they can be detected and encoded properly? I've made many horribly broken attempts to encode these chrs but my lack of knowledge in this area always gets the last laugh.

Recently mirod posted Converting character encodings which includes a regex from XML::TiePYX that gets very close to doing the job, but it only encodes some of the characters, not all. It barfs on ¤ and probably others:

# This is the regex from XML::TiePYX $xml =~ s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr(( +($hi & 0x03) <<6) | ($lo & 0x3F)) }ge;
I seek an extended version of the XML::TiePYX regex to find and encode the full range of high-bit chrs specified in the first solution. I'd rather not use another module (XML parser or otherwise) for this task.

thanks for your time - epoptai

--
Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.


In reply to Regex to encode entities in XML by epoptai

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.