Melly has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I'm running the following script against some CSV files prior to doing anything else. It basically checks for non-ascii, and either makes the specified conversion, or shows the value that it didn't have a spec for.
However, others will need to make use of the script, which means that adding new regexes to it would be inconvenient. We use SVN, but that's still a kludge.
Can anyone see a good way to load the RegExes from an eternal file?
Also, and unrelated, do I need to use binary? It seems to be advised, but I thought that was just line-endings, and it means that I have to do my own translation, which seems a bit redundant. And can I check? Do I need "use utf8" as I'm just searching for octal sequences? (iirc yes).
use strict; use utf8; use File::Basename; my @files = glob($ARGV[0]); my $outdir = $ARGV[1]; my $debug = $ARGV[2]; die "No output directory given\n" unless -d $outdir; $outdir =~ s/\\/\//g; # backslash to forward $outdir =~ s/([^\/])$/$1\//; # add final slash if missing foreach my $file(@files){ my $outfile = $outdir . '/' . basename($file); open(CSV, '<', $file)||die "Cannot open $file for read:$!\n"; binmode CSV; open(OUT, '>', $outfile)||die "Cannot open $outfile for write:$!\n"; while (my $line = <CSV>){ $line =~ s/\x0D\x0A/\n/g; # binary, so we're still stuck with \r\n + dos endings possibly - why are we using binary? if($line =~ /[^[:ascii:]]/){ print "Before: $line\n" if $debug; # translations from octal sequence to ascii char $line =~ s/\302\267/./g; # odd utf 'floating' point to a +scii . $line =~ s/\342\200\230/'/g; # left single curly quote to as +cii ' $line =~ s/\342\200\231/'/g; # right single curly quote to a +scii ' $line =~ s/\342\200\223/-/g; # em-dash to ascii - $line =~ s/\303\257/i/g; # double-dot i to ascii i $line =~ s/\302\243/GBP/g; # pound sign to GBP $line =~ s/\342\200\246/.../g; # elipsis to ascii ... $line =~ s/\302\256/(a)/g; # @ to (a) $line =~ s/\303\250/e/g; # grave e to e $line =~ s/\303\251/e/g; # acute e to e $line =~ s/\342\211\244/\>\=/g; # utf >= to ascii >= $line =~ s/\342\211\245/\<\=/g; # utf <= to ascii <= $line =~ s/\303\264/o/g; # circumflex o (?!?) to ascii o $line =~ s/\302\240/\s/g; # nbsp to sp $line =~ s/\302\263/\^3/g; # superscript 3 to ^3 $line =~ s/\302\262/\^2/g; # superscript 2 to ^2 $line =~ s/\302\260/ degrees/g; # degrees symbol to word ' degr +ees' $line =~ s/\342\200\235/""/g; # right double curly quote to a +scii " (escaped for csv) $line =~ s/\342\200\234/""/g; # left double curly quote to as +cii " (escaped for csv) $line =~ s/\302\275/1\/2/g; # utf 1/2 to ascii plain 1/2 if($line =~ /[^[:ascii:]]/){ $line =~ s/([^[:ascii:]])/'[' . (ord $1) . '\/' . (sprintf("0x +%X", (ord $1))) . '\/' . (sprintf("%o", (ord $1))) . ']'/ge; print "Unhandled sequence: $line\n"; } print "After: $line\n" if $debug; } print OUT "$line"; } }
Tom Melly, pm (at) cursingmaggot (stop) co (stop) ukmap{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Read RegEx from file
by 1nickt (Canon) on Jan 14, 2016 at 16:27 UTC | |
by Melly (Chaplain) on Jan 14, 2016 at 16:42 UTC | |
by 1nickt (Canon) on Jan 14, 2016 at 16:55 UTC | |
|
Re: Read RegEx from file
by AnomalousMonk (Archbishop) on Jan 14, 2016 at 23:12 UTC | |
by Melly (Chaplain) on Jan 15, 2016 at 12:44 UTC | |
by CountZero (Bishop) on Jan 16, 2016 at 09:50 UTC |