Re: Why does Encode::Repair only correctly fix one of these two tandem characters?

Hello Jim.

I wonder this is your intentional emulation or, in case you don't notice...
The second decode is really strange. I would like to use terms, "internal char" and "bytes".
Second decode expects Windows-1252 bytes but it receives UTF-8 bytes.

$ldqm=encode('UTF-8',                 #internal char to utf-8 bytes
    decode('Windows-1252',          #This expects Windows-1252 bytes b
+ut utf-8 bytes passed from outer encode
        encode('UTF-8',$ldgm)));  #here internal char to UTF-8bytes
[download]

So, how about using from_to, bytes to bytes conversion?

my $buff=encode('UTF-8',$ldgm);                #internal char to utf-8
+ bytes
from_to($buff, 'UTF-8', 'Windows-1252');         #now buff converted i
+nto 1252 bytes
$buff=decode('Windows-1252', $buff);            #1252 bytes converted 
+into internal char
print 'ret=' . encode('UTF-8', $buff);            #encode into UTF8 by
+tes and print
[download]

regards

Comment on Re: Why does Encode::Repair only correctly fix one of these two tandem characters? Select or Download Code

Replies are listed 'Best First'.
Re^2: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 10, 2014 at 16:24 UTC
TMTOWTDI. I think the right-to-left pipeline I used to damage the characters for the purpose of the demonstration… `# <-- 3 <-- 2 <-- 1 $foo = encode('UTF-8', decode('Windows-1252', encode('UTF-8', $foo)));` [download] …more clearly emulates what actually happens in the wild: text is encoded in UTF-8, then wrongly decoded as if it were encoded in Windows-1252, then encoded again in UTF-8. I'm not sure what using the in-place convenience function `Encode::from_to()` lends to the clarity and effectiveness of the demonstration of the sequence of events. FWIW, Encode::Repair uses `Encode::encode()` and `Encode::decode()`, not `Encode::from_to()`.	[reply] [d/l] [select]
Re^3: Why does Encode::Repair only correctly fix one of these two tandem characters? by remiah (Hermit) on Aug 11, 2014 at 02:05 UTC
I am so slow to understand the situation... I thought proper byte conversion from UTF-8 to cp1252 will solve the whole problem(between 1 and 2 of the pipeline). But the wrongly encoded text is already there in the wilderness, and no way back, right? Then I have no good idea...	[reply]
Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters? by Jim (Curate) on Aug 11, 2014 at 02:26 UTC
Sorry, I don't understand you.	[reply]