Charset tornment

Mysjkin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brethren,

I am trying to concieve the regex machine that e.g an 'å' is a letter

I am processing reference lists to make all references i a common form. I have got them stored in postgres, but I need to tear apart the Author fields to make them on the same form for all references, i.e. they may be written as:
Larry Wall, Tom Christensen & Jon Orwant or
Wall, L., Christensen, T. and Orwant J. or
L. Wall, T. Christensen & J. Orwant etc.

I am using Parse::RecDescent to do this, and a part of this is an regex picking out the names which preferably should have been:
/[[upper:]][[:lower:]]+/
( presently disregarding a few problems such as double names, Mac... names etc) This works fine for some names, but then names like Ås, Lönsjö, Magnússon and so on starts to mess things up... My present name match is :
/[A-ZÆØÅ][a-zæøåäöüúáóé]+/
which does not look nice...
I have tried use utf8; but I causes a lot of Malformed UTF-8 character (unexpected non-continuation byte 0xd8, immediately after start byte 0xc6) in index at /opt/perl5/lib/site_perl/5.8.0/Parse/RecDescent.pm line 2979.

I have looked into use locale; but have so far not understood what I should do to tell the regex machine to have a wider understanding of charsets than just plain ascii. Mysjkin

Comment on Charset tornment Select or Download Code

Replies are listed 'Best First'.
Re: Charset tornment by zentara (Cardinal) on Jan 27, 2003 at 14:44 UTC
Maybe this will give you some ideas. `#!/usr/bin/perl # Solution for matching "é" use locale ; my $test = "é é é é " ; $test =~ /[[:alpha:]]/ and print "YEAH\n" ; #OR $test =~ /[\w]/ and print "Thanx\n" ;` [download]	[reply] [d/l]
Re: Charset tornment by FamousLongAgo (Friar) on Jan 27, 2003 at 14:02 UTC
I think you're on the right track with `use utf8`, but the error message you get suggests that your input and/or program file may not be saved in that encoding. Try saving both the input file and the Perl script in UTF-8 and see what happens (most modern text editors let you set the encoding). You can check that the conversion worked by viewing the documents in a Web browser, for example, with UTF-8 selected as your charset.	[reply] [d/l]
Re: Charset tornment by diotalevi (Canon) on Jan 27, 2003 at 14:52 UTC
Your expression does not work for all names. I hardly consider myself unusual but I'm "Joshua ben Jore". The middle part is specifically lower case. I'm not in your list of references since I haven't written any books, I'm just alerting you that your regex is not sufficient as a name recognizer. Also consider "Natalie Johnson-Lee". There you've got a hyphenated last name which is pretty common. This is what I thought of on the spot - I'm sure you can come up with more. [update Apparently you address this elsewhere. My bad] You probably need to set the locale. I don't really understand the locale system having never used it but I'd like to refer you to perllocale as a starting place. Seeking Green geeks in Minnesota	[reply]
Re: Charset tornment by mattr (Curate) on Jan 27, 2003 at 15:57 UTC
I don't see cedilles or circonflexes in your list.. presumably ibid and opcit don't foul you up. Why not choose "word" characters and non-word characters? I may have missed something here like punctuation being included in the class?	[reply]
Re: Re: Charset tornment by Mysjkin (Initiate) on Jan 28, 2003 at 07:08 UTC
mattr wrote: >Why not choose "word" characters and non-word characters? That is exactly what I would like to do, but as I have things set up at the moment, \w=A-Za-z0-9_\. The reason I do not have any cedilles or circonflexes are that they so far not have turned up in the names I am parsing...	[reply]
Re: Re: Re: Charset tornment by mattr (Curate) on Jan 28, 2003 at 14:52 UTC
Ah. I can tell you I got the same errors (actually they kept printing until I ran out of memory which wasn't fun) compiling Jcode.pm on Perl 5.8 (since 5.8 has jcode.pl which to me is obsolete, I use object oriented module instead). I think the latest version of the module now works, so you could see what the difference between them is. Jcode.pm has an English man page so there should be no trouble there, just make the tgz don't bother installing it. Caveat- I have a hazy memory so try to compile both maybe, anyway it's really not a big module at all. Also I can say that in the past I have used ShiftJIS::Regexp to do regexes on Japanese (not using Unicode, just SJIS-encoded Japanese which is 8-bit strings). Possibly the re or match functions might actually work for you as-is. Like someone said, check your locale and maybe run some tests. For the record when I got the same errors as you I was doing a vanilla install of RedHat 8.0 with their Perl from RPM and I was also tearing things apart to wedge it into a very small old hard disk. So maybe we had the same problem..	[reply]