foftoogs has asked for the wisdom of the Perl Monks concerning the following question:

Hey yall, I am having an issue with some ASCII characters and soap (SOAP::Lite) but I think this is a problem that is not really localised to that module. It seems that some chars like ` are encoded differently that others when they are supposed to be the same character. When the character ` for example is entered into a text field by the keyboard the translation of that character seems to be done correctly and hence things run smoothly but when cut and paste into text box from a resource on a webpage the translation does not occur correctly even though it looks like the same character. I imagine there is some translation code or module that I can employ to scan for these and translate them to the ASCII char that SOAP can encode correctly but I am not too familiar with these sorts of things.. anyone had this sort of stuff go down before and have some sort of solution for it?
  • Comment on character differences and SOAP encoding

Replies are listed 'Best First'.
Re: character differences and SOAP encoding
by almut (Canon) on Apr 23, 2007 at 12:41 UTC

    You might have become a victim of smart quotes, which is a feature of some programs (in particular word processors) to automatically replace regular ASCII single/double quotes with their curly counterparts, that typographers and designers are so fond of, because the quotes' opening (left-side) and closing (right-side) representations possess the slightly different look, as used in professional typesetting. This means that if you type: 'word', you'll get ‘word’ (or even some other form, depending on the locale), or "word" —> “word” (zoom in if you can't see a difference...) They are entirely different (non-ASCII) characters.

    How to get rid of them depends on how they're encoded. In Unicode, they are the codepoints U+2018 - U+201B, while for example in CP1252, they are 0x91, 0x92, 0x82 (curly single quotes) and 0x93, 0x94, 0x84 (curly double quotes).

    You can replace them using Perl's tr/// or s///, e.g.

    # unicode tr/\x{2018}-\x{201B}/'/; tr/\x{201C}-\x{201F}/"/; # CP1252 tr/\x91\x92\x82/'/; tr/\x93\x94\x84/"/; # or, if you have UTF-8 data which isn't properly flagged as such, # you can try to directly replace the multi-byte sequence, as # they're encoded in UTF-8, e.g. s/\xe2\x80\x98/'/g; # one of the single quotes, # ... # and similarly for the 7 others...

    (This is just a guess, I can't tell for sure whether that really is your problem... you haven't told in detail in what way the quotes appear wrong...)