in reply to keeping diacritical marks in a string

Hi there

An interesting question, but I think you will have to provide some more information:

I won't be online till tomorrow, but hopefully someone else will be able to help you if you provide that information.

Cheers, F.V.S.

  • Comment on Re: keeping diacritical marks in a string

Replies are listed 'Best First'.
Re^2: keeping diacritical marks in a string
by Foxpond Hollow (Sexton) on Oct 09, 2009 at 02:07 UTC
    I don't know if it alerts you when a post you've commented on is updated, so in case it doesn't, I've updated the post with the info you asked for. Note that the second update has the correct URL and you should ignore the URL in the first update.

      Hmm

      I can't see the obvious source of the problem. I think you need to dump out the result of the request before any processing and be sure exactly where the special characters are being lost. i.e. is it coming correctly out of LWP, is it the regex, could it be the MARC:: module, etc.

      As graff said it shouldn't be losing these characters, but there are a number of places where things can go wrong.

      It's all a bit complicated and I can't think of a good guide to it at the moment. On the other hand, I've never heard of Perl completely stripping special characters because of an encoding problem - normally, you would get a multi-byte utf-8 character treated as 2 or 3 characters if the encoding is not set correctly. So I suspect an error in some code somewhere - could it be that something is validating input and stripping out characters it doesn't think are "safe"...?

      Sorry I can't be of more help. Try to narrow it down to where they disappear and it will be solved eventually.

        Thank you! You were right, one of the things I forgot to mention was that I was doing a substitution later to remove any punctuation characters, and it was:

        s/[^\w\s]//g


        I didn't realize that accented letters didn't count in the \w match, I figured they were still alphanumeric. Well that's kind of annoying. I was using that to normalize the string and remove anything like commas and semicolons. Now I have to make a list of all the characters I want to remove, instead of being able to just specify the ones I want to keep. Oh well, at least we've found the problem. Still though, is there some way to convert accented letters to just remove the accent and keep the letter? It would seem a better solution than listing out everything that is not a letter, number, space, or letter with an accent.