character differences and SOAP encoding

foftoogs has asked for the wisdom of the Perl Monks concerning the following question:

Hey yall, I am having an issue with some ASCII characters and soap (SOAP::Lite) but I think this is a problem that is not really localised to that module. It seems that some chars like ` are encoded differently that others when they are supposed to be the same character. When the character ` for example is entered into a text field by the keyboard the translation of that character seems to be done correctly and hence things run smoothly but when cut and paste into text box from a resource on a webpage the translation does not occur correctly even though it looks like the same character. I imagine there is some translation code or module that I can employ to scan for these and translate them to the ASCII char that SOAP can encode correctly but I am not too familiar with these sorts of things.. anyone had this sort of stuff go down before and have some sort of solution for it?

Comment on character differences and SOAP encoding

Replies are listed 'Best First'.
Re: character differences and SOAP encoding by almut (Canon) on Apr 23, 2007 at 12:41 UTC
You might have become a victim of smart quotes, which is a feature of some programs (in particular word processors) to automatically replace regular ASCII single/double quotes with their curly counterparts, that typographers and designers are so fond of, because the quotes' opening (left-side) and closing (right-side) representations possess the slightly different look, as used in professional typesetting. This means that if you type: 'word', you'll get ‘word’ (or even some other form, depending on the locale), or "word" —> “word” (zoom in if you can't see a difference...) They are entirely different (non-ASCII) characters. How to get rid of them depends on how they're encoded. In Unicode, they are the codepoints U+2018 - U+201B, while for example in CP1252, they are 0x91, 0x92, 0x82 (curly single quotes) and 0x93, 0x94, 0x84 (curly double quotes). You can replace them using Perl's `tr///` or `s///`, e.g. `# unicode tr/\x{2018}-\x{201B}/'/; tr/\x{201C}-\x{201F}/"/; # CP1252 tr/\x91\x92\x82/'/; tr/\x93\x94\x84/"/; # or, if you have UTF-8 data which isn't properly flagged as such, # you can try to directly replace the multi-byte sequence, as # they're encoded in UTF-8, e.g. s/\xe2\x80\x98/'/g; # one of the single quotes, # ... # and similarly for the 7 others...` [download] (This is just a guess, I can't tell for sure whether that really is your problem... you haven't told in detail in what way the quotes appear wrong...)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: character differences and SOAP encoding
by almut (Canon) on Apr 23, 2007 at 12:41 UTC

You might have become a victim of smart quotes, which is a feature of some programs (in particular word processors) to automatically replace regular ASCII single/double quotes with their curly counterparts, that typographers and designers are so fond of, because the quotes' opening (left-side) and closing (right-side) representations possess the slightly different look, as used in professional typesetting. This means that if you type: 'word', you'll get ‘word’ (or even some other form, depending on the locale), or "word" —> “word” (zoom in if you can't see a difference...) They are entirely different (non-ASCII) characters.

How to get rid of them depends on how they're encoded. In Unicode, they are the codepoints U+2018 - U+201B, while for example in CP1252, they are 0x91, 0x92, 0x82 (curly single quotes) and 0x93, 0x94, 0x84 (curly double quotes).

You can replace them using Perl's tr/// or s///, e.g.

# unicode
tr/\x{2018}-\x{201B}/'/;
tr/\x{201C}-\x{201F}/"/;

# CP1252
tr/\x91\x92\x82/'/;
tr/\x93\x94\x84/"/;

# or, if you have UTF-8 data which isn't properly flagged as such,
# you can try to directly replace the multi-byte sequence, as
# they're encoded in UTF-8, e.g.
s/\xe2\x80\x98/'/g;  # one of the single quotes,
# ...                # and similarly for the 7 others...
[download]

(This is just a guess, I can't tell for sure whether that really is your problem... you haven't told in detail in what way the quotes appear wrong...)

[reply]
[d/l]
[select]