Keep It Simple, Stupid | |
PerlMonks |
Re: A bit more complex resortingby fizbin (Chaplain) |
on Aug 21, 2005 at 16:20 UTC ( [id://485545]=note: print w/replies, xml ) | Need Help?? |
I have to disagree with the previous poster - there's nothing really database-y about what you want to do. I'm not going to do everything, just the bits that I found interesting. Note that the following works reliably only on perl 5.8 and above. The interesting thing in what you ask is to remove all those accents. The easiest way to do that is with the Unicode::Normalize module, which is not installed by default. (You'll need to install that via CPAN) This module gives you access to various Unicode normalization forms; the one we'll use is called NFKD, which splits all accented letters into multi-character sequences of letter + combining accent mark. Then, you can use a regular expression using perl's support for unicode properties to remove any character that has the "mark" property. (that's what \pM is doing below) So here's code that'll do what you want, except for the splitting the lines into categories and seting FOO BAR and BAZ from the command line, both of which should be easy changes to make.
And that's it. For older perl versions, you'd probably have to go through and manually create a lookup table to convert from an accented letter to a non-accented letter. Update: Changed the code to something that'll work in perl 5.6 and higher, though this code is highly fragile on perls that old, and the slightest change is liable to cause your output to spring back to utf-8.
In Section
Seekers of Perl Wisdom
|
|