Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: A bit more complex resorting

by fizbin (Chaplain)
on Aug 21, 2005 at 16:20 UTC ( #485545=note: print w/replies, xml ) Need Help??

in reply to A bit more complex resorting

I have to disagree with the previous poster - there's nothing really database-y about what you want to do.

I'm not going to do everything, just the bits that I found interesting.

Note that the following works reliably only on perl 5.8 and above.
Although you may get the following to work in an earlier perl, the outcome will likely not be what you want. Specifically, the output is likely to be in utf8, which I strongly suspect is not what you wanted.

The interesting thing in what you ask is to remove all those accents. The easiest way to do that is with the Unicode::Normalize module, which is not installed by default. (You'll need to install that via CPAN) This module gives you access to various Unicode normalization forms; the one we'll use is called NFKD, which splits all accented letters into multi-character sequences of letter + combining accent mark. Then, you can use a regular expression using perl's support for unicode properties to remove any character that has the "mark" property. (that's what \pM is doing below)

So here's code that'll do what you want, except for the splitting the lines into categories and seting FOO BAR and BAZ from the command line, both of which should be easy changes to make.

#! perl use Unicode::Normalize; # for the NFKD function use strict; use warnings; my ($FOO, $BAR, $BAZ) = qw(FOO BAR BAZ); # uses system default encoding for INFILE; say # '<:encoding(iso-8859-1)' to explicitly use iso-latin-1 open(INFILE, '<', 'test1.txt'); while (<INFILE>) { chomp; my ($category, $fornom, $surnom, $pass, @rest) = split; die "Extra crud at the end of the line: @rest" if (@rest); my ($squashed_fornom) = NFKD($fornom) # NFKD separates accented +letters # into letters + combining + mark $squashed_fornom =~ s/\pM//g; # remove marks $squashed_fornom = lc($squashed_fornom); # lowercase my ($squashed_surnom) = NFKD($surnom); $squashed_surnom =~ s/\pM//g; $squashed_surnom = lc($squashed_surnom); print "$squashed_fornom $squashed_surnom"; print "|$pass|$fornom $surnom|$FOO|$BAR|$BAZ|$FOO $fornom $surnom\n" +; }

And that's it. For older perl versions, you'd probably have to go through and manually create a lookup table to convert from an accented letter to a non-accented letter.

Update: Changed the code to something that'll work in perl 5.6 and higher, though this code is highly fragile on perls that old, and the slightest change is liable to cause your output to spring back to utf-8.

-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://485545]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2023-01-30 18:32 GMT
Find Nodes?
    Voting Booth?

    No recent polls found