in reply to Re^4: keeping diacritical marks in a string
in thread keeping diacritical marks in a string

OK. That's good.

Actually, I don't have a copy of Perl to verify but I think you "should" be able to get that substitution to work as you intended.

I think the fact that it doesn't work indicates that perl is not treating the string as utf-8. Basically there is a flag that's stored in the data structure. If it's not set, it will be treated just as a string of bytes and character operations like uc() and your substitution will work in ASCII mode (actually, ISO-8859-1 I've just learned - see below).

There are a number of ways to get the string to be treated as utf-8, and I'm not sure which ones are "correct" in this situation. But try doing this, after you get the $HTML and before you start doing operations on it:

use Encode; # ... $HTML = decode_utf8($HTML);

You can also use a more brute-force approach:

utf8::upgrade($HTML);

I think the decode method is preferred, but perhaps someone else will correct / confirm.

It is a complex topic, but the following documents are a good place to start:

I hope this solves your problem. Keep us posted...

FVS

Replies are listed 'Best First'.
Re^6: keeping diacritical marks in a string
by Foxpond Hollow (Sexton) on Oct 10, 2009 at 00:47 UTC
    Unfortunately both suggested methods seem to change the accented characters into question marks. Which (somewhat ironically), just raises more questions.

      Oh dear. I was sure that was going to work.

      Are you sure that it's changing them to question marks? i.e. are you sure it's not just that your terminal can't display the unicode characters? Sorry for stating the obvious - but you need to view the output in a browser, or something else that can display unicode, or check the hex values to see if they're correct.

      Alternatively, you could work around the problem by doing the input cleanup in a different way, but I think your original approach was correct and it should work if we can get perl to treat the strings as unicode.

      The fact that it does something different does make me think that it might be working, but it's showing up another problem somewhere.

      I might not be online much for a couple of days (I'm actually in an internet cafe in Vientiane, Laos...) but I would suggest drawing attention to this thread in the chatterbox at a busy time to get someone else to look at it. It seems to have gone quiet. Or perhaps it would be justified to start a new thread with your problem more narrowed down.

      Good luck...