in reply to Cleaning Data Between Specified Columns

Using substr for this is kinda tricky. If one of the earlier ranges contains 's, and you delete them, that screws up the indexes for later ranges. One solution is to substitute a known-not-present char (I used \x7F) for ' whilst processing the ranges and then remove these from the resultant line.

Update:A modified version to deal with replacing 's with spaces at the end of the field rather than deleting them entirely. Makes use of Fletch's neat trick and Aristotle's enhancement to it, now that is possible.

#! perl -sw use warnings; use strict; open( FILE, '<', shift) or die "Couldn't open $::FILE; $!"; @ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file'; while (my $line = <FILE>) { for ( @ARGV ) { next if $_->[0] > length $line; $_->[1] = length $line if $_->[1] > length $line; local *_ = \substr($line, $_->[0], $_->[1]-$_->[0] + 1); tr[a-zA-Z0-9\n\|\-'][ ]c; $_ .= ' ' x tr['][]d; } print $line; } close FILE;

Original version

#! perl -sw use warnings; use strict; open( FILE, '<', shift) or die "Couldn't open $::FILE; $!"; @ARGV = map{ [ split(/-/, $_, 2) ] } @ARGV or die 'usage $0: file c1-c2 [ c1-c2 [ ... ] ] >modified_file'; while (my $line = <FILE>) { for ( @ARGV ) { next if $_->[0] > length $line; $_->[1] = length $line if $_->[1] > length $line; substr($line, $_->[0], $_->[1]-$_->[0] + 1) =~ s[(^.*$)] { local $a = $1; $a =~ tr[a-zA-Z0-9\n\|\-'][ ]c; $a =~ tr['][\x7f]; $a; }e; $line =~ tr[\x7F][]d; } print $line; } close FILE;

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re: Re: Cleaning Data Between Specified Columns
by enoch (Chaplain) on Jan 27, 2003 at 22:17 UTC
    ++BrowserUK!

    I was bitten by my squashing of apostrophes. Because it was a fixed width file, squashing the apostrophes caused the width's to change. I changed the regex to:
    s/(.)'(.\B*)/$1$2 /g
    So, that spacing was added for each apostrophe I pulled from any field.

    enoch

      In that case, there's no need for this line

      $tmpString =~ s/(.)'(.)/$1$2/g; # squash apostrophes

      As the next line

      $tmpString =~ tr/a-zA-Z0-9\n\|\-/ /c; # remove bad characters

      will convert them to spaces anyway?


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        No, could't do that because I need apostrophe's squashed. For example, I need O'Connor to become OConnor and not O Connor. However, I did need something like you@you.com to become you you.com.

        Weird requirements, I know. I am going to play around with what Fletch did here and see what I can come up with.

        Thanks, everyone,
        enoch
Re^2: Cleaning Data Between Specified Columns
by Aristotle (Chancellor) on Jan 28, 2003 at 01:54 UTC
    Using "known nonexistant" characters is just asking for trouble.. it's a practice I've come to regard as a huge red flag. In this particular case and with Perl being Perl, the proper solution is surprising but very neat. Fletch++

    Makeshifts last the longest.

      Sorry Aristotle. Fletch's (partial) solution, neat as the technique is, falls foul of the fact that deleting the apostrophies in a one range, causes all the subsequent columns to shift.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        I should have tested. Anyway, in this case, it's a simple matter of changing the order of operations:
        { local *_ = \substr $source, $start, $len; y/a-zA-Z0-9\n\|-/ /c; y/'//d; }
        However, that obviously only works if there's only one operation affecting length. For a more general case, I'd do something like this (untested):
        #!/usr/bin/perl -w use strict; my @range = map /^(\d+)-(\d+)$/, sort { $a <=> $b } splice @ARGV, 1; unshift @range, 0; $range[$_] = 1 + $range[$_+1] - $range[$_] for 0 .. $#range-1; $range[-1] = '*'; die "Negative length field specified" if grep $_ < 0, @range[0 .. $#range-1]; my $fmt = join " ", map "A$_", @range; # pick odd numbered elements my @selected = map 1 + $_ * 2, 0..$range_/2; while(<>) { my @field = unpack $fmt, $_; for (@field[@selected]) { tr/a-zA-Z0-9\n\|\-'/ /c; tr/'//d; } print join '', @field; }
        The point is to structure your data whenever possible. An array element end is never ambiguous, a \x7F can happen to be, and in my case, whatever my mark character, I've always been bitten by it.

        Makeshifts last the longest.