Re: keeping diacritical marks in a string

in reply to keeping diacritical marks in a string

Hi there

An interesting question, but I think you will have to provide some more information:

What do the special characters look like in the original data?>
Are they encoded with &#xxx; or in some encoding scheme?
Is there a header indicating the encoding?
What do the characters look like when viewed as hex?
What is the bit of code that is removing the characters?
What version of perl are you using?

I won't be online till tomorrow, but hopefully someone else will be able to help you if you provide that information.

Cheers, F.V.S.

Comment on Re: keeping diacritical marks in a string

Replies are listed 'Best First'.
Re^2: keeping diacritical marks in a string by Foxpond Hollow (Sexton) on Oct 09, 2009 at 02:07 UTC
I don't know if it alerts you when a post you've commented on is updated, so in case it doesn't, I've updated the post with the info you asked for. Note that the second update has the correct URL and you should ignore the URL in the first update.	[reply]
Re^3: keeping diacritical marks in a string by FalseVinylShrub (Chaplain) on Oct 09, 2009 at 06:06 UTC
Hmm I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc. As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong. It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...? Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.	[reply]
Re^4: keeping diacritical marks in a string by Foxpond Hollow (Sexton) on Oct 09, 2009 at 06:51 UTC
Thank you! You were right, one of the things I forgot to mention was that I was doing a substitution later to remove any punctuation characters, and it was: `s/[^\w\s]//g` [download] I didn't realize that accented letters didn't count in the \w match, I figured they were still alphanumeric. Well that's kind of annoying. I was using that to normalize the string and remove anything like commas and semicolons. Now I have to make a list of all the characters I want to remove, instead of being able to just specify the ones I want to keep. Oh well, at least we've found the problem. Still though, is there some way to convert accented letters to just remove the accent and keep the letter? It would seem a better solution than listing out everything that is not a letter, number, space, or letter with an accent.	[reply] [d/l]
Re^5: keeping diacritical marks in a string by FalseVinylShrub (Chaplain) on Oct 09, 2009 at 10:42 UTC
Re^6: keeping diacritical marks in a string by Foxpond Hollow (Sexton) on Oct 10, 2009 at 00:47 UTC
Some notes below your chosen depth have not been shown here

In Section Seekers of Perl Wisdom