Re: Cleaning Data Between Specified Columns

Using substr for this is kinda tricky. If one of the earlier ranges contains 's, and you delete them, that screws up the indexes for later ranges. One solution is to substitute a known-not-present char (I used \x7F) for ' whilst processing the ranges and then remove these from the resultant line.

Update:A modified version to deal with replacing 's with spaces at the end of the field rather than deleting them entirely. Makes use of Fletch's neat trick and Aristotle's enhancement to it, now that is possible.

#! perl -sw
use warnings;
use strict;

open( FILE, '<', shift) or die "Couldn't open $::FILE; $!";
@ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV
    or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file';
while (my $line = <FILE>) {
    for (  @ARGV ) {
        next if $_->[0] > length $line;
        $_->[1] = length $line if $_->[1] > length $line;
        local *_ = \substr($line, $_->[0], $_->[1]-$_->[0] + 1);
        tr[a-zA-Z0-9\n\|\-'][ ]c;
        $_ .= ' ' x tr['][]d;
    }
    print $line;
}
close FILE;
[download]

Original version

#! perl -sw
use warnings;
use strict;

open( FILE, '<', shift) or die "Couldn't open $::FILE; $!";
@ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV
    or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file';
while (my $line = <FILE>) {
    for (  @ARGV ) {
        next if $_->[0] > length $line;
        $_->[1] = length $line if $_->[1] > length $line;
        substr($line, $_->[0], $_->[1]-$_->[0] + 1) =~
            s[(^.*$)]
            {
                local $a = $1;
                $a =~ tr[a-zA-Z0-9\n\|\-'][ ]c;
                $a =~ tr['][\x7f];
                $a;
            }e;
        $line =~ tr[\x7F][]d;
    }
    print $line;
}
close FILE;
[download]

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Comment on Re: Cleaning Data Between Specified Columns Select or Download Code

Replies are listed 'Best First'.
Re: Re: Cleaning Data Between Specified Columns by enoch (Chaplain) on Jan 27, 2003 at 22:17 UTC
++BrowserUK! I was bitten by my squashing of apostrophes. Because it was a fixed width file, squashing the apostrophes caused the width's to change. I changed the regex to: `s/(.)'(.\B*)/$1$2 /g` [download] So, that spacing was added for each apostrophe I pulled from any field. enoch	[reply] [d/l]
Re: Re: Re: Cleaning Data Between Specified Columns by BrowserUk (Patriarch) on Jan 27, 2003 at 22:26 UTC
In that case, there's no need for this line `$tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophes` [download] As the next line `$tmpString =~ tr/a-zA-Z0-9\n\\|\-/ /c; # remove bad characters` [download] will convert them to spaces anyway? Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply] [d/l] [select]
Re: Re: Re: Re: Cleaning Data Between Specified Columns by enoch (Chaplain) on Jan 28, 2003 at 02:31 UTC
No, could't do that because I need apostrophe's squashed. For example, I need `O'Connor` to become `OConnor` and not `O Connor`. However, I did need something like `you@you.com` to become `you you.com`. Weird requirements, I know. I am going to play around with what Fletch did here and see what I can come up with. Thanks, everyone, enoch	[reply] [d/l] [select]
Re^2: Cleaning Data Between Specified Columns by Aristotle (Chancellor) on Jan 28, 2003 at 01:54 UTC
Using "known nonexistant" characters is just asking for trouble.. it's a practice I've come to regard as a huge red flag. In this particular case and with Perl being Perl, the proper solution is surprising but very neat. Fletch++ Makeshifts last the longest.	[reply]
Re: Re^2: Cleaning Data Between Specified Columns by BrowserUk (Patriarch) on Jan 28, 2003 at 03:05 UTC
Sorry Aristotle. Fletch's (partial) solution, neat as the technique is, falls foul of the fact that deleting the apostrophies in a one range, causes all the subsequent columns to shift. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply]
Re^4: Cleaning Data Between Specified Columns by Aristotle (Chancellor) on Jan 28, 2003 at 09:35 UTC
I should have tested. Anyway, in this case, it's a simple matter of changing the order of operations: `{ local _ = \substr $source, $start, $len; y/a-zA-Z0-9\n\\|-/ /c; y/'//d; }` [download] However, that obviously only works if there's only one operation affecting length. For a more general case, I'd do something like this (untested): #!/usr/bin/perl -w use strict; my @range = map /^(\d+)-(\d+)$/, sort { $a <=> $b } splice @ARGV, 1; unshift @range, 0; $range[$_] = 1 + $range[$_+1] - $range[$_] for 0 .. $#range-1; $range[-1] = ''; die "Negative length field specified" if grep $_ < 0, @range[0 .. $#range-1]; my $fmt = join " ", map "A$_", @range; # pick odd numbered elements my @selected = map 1 + $_ * 2, 0..$range_/2; while(<>) { my @field = unpack $fmt, $_; for (@field[@selected]) { tr/a-zA-Z0-9\n\\|\-'/ /c; tr/'//d; } print join '', @field; } [download] The point is to structure your data whenever possible. An array element end is never ambiguous, a `\x7F` can happen to be, and in my case, whatever my mark character, I've always been bitten by it. Makeshifts last the longest.	[reply] [d/l] [select]