Remove empty column(s) from unpack template

daseme has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks,

BrowserUk graciously wrote code to determine column boundaries for files for which the column positions are not known before hand. However, the code occasionally identifies "false" columns.

I am trying to implement a slightly modified version of the heuristic suggested by BrowserUk here.

The steps

identify cols for which over 50% of the fields are empty
remove those "mostly" empty cols from the unpack template
add the col width from the "mostly" empty cols to the appropriate col in the template

The appropriate col is the col to the left if it is left-aligned, or the col to the right if it is right-aligned

If it helps, I think that I will only have problems with false columns being created from either of the first two cols.

This has been my biggest code challenge to date. And while the code appears to accomplish these goals, it seems cumbersome and perhaps flawed. I am asking for your help to identify improvements.

#!/usr/bin/perl
use strict;
use warnings;

my @lines = <DATA>;
## Pass 1. OR the records with a mask of spaces
my $mask = chr(32) x length $lines[0];

$mask |= $_ for @lines;

## Detect the spaces that remain and build the template
my $templ = '';
$templ .= 'a' . length($1) . 'x' . length($2) . ' '
  while $mask =~ m[([^ ]+)( +|$)]g;
$templ        =~ s[x\d+\s+$][];       ## Strip redundant last 'xN'
print "original template: " . $templ . "\n\n";

########################
## BEGIN false cols code
########################

## setup the variables
my $count_empty = -1;
my (@fields,@empty,%empty_hash,@AoAfields,@right_align,@left_align);
## create array of unpacked lines
foreach my $line (@lines) {
    push @fields, join '|', unpack($templ, $line);
}
## create AoA so we can process cols
for my $i ( 0 .. $#fields ) {
    $AoAfields[$i] = [ split /\|/, $fields[$i] ];
    }
## loop through AoA finding alignment and empty cols
for my $i ( 0 .. $#AoAfields ) { # for every row in AoA
    my $aref = $AoAfields[$i];
    my $col_numbers = @$aref - 1;
    for my $j ( 0 .. $col_numbers ) { # for every col in AoA
            if ($AoAfields[$i][$j]=~ /^\s+\S/) { # find right-aligned
                push @right_align, $j; 
            }
            if ($AoAfields[$i][$j]=~ /\S\s+$/) { # find left-aligned
                push @left_align, $j; 
            }
            if ($AoAfields[$i][$j]=~ /^\s+$/) { # find fields w/ only 
+spaces
                $count_empty++;
                $empty_hash{$j} = $count_empty;
            }
    }
}
## first remove duplicates in arrays 
&remove_duplicates(\@left_align);
&remove_duplicates(\@right_align);
my ($key, $val);
while (($key, $val) = each(%empty_hash)){
    if ($val/($#AoAfields+1)>.5) { #if column more than 50% empty
        push @empty, $key;
    }
}
## create array from template string
my @templs = split(/\s+/,$templ);
# create hashes of left-aligned & right-aligned for grep
my %left_temp;
my %right_temp;
@left_temp{@left_align} = @left_align;
@right_temp{@right_align} = @right_align;
## find out if col to the left/right of empty col is left/ right-align
+ed, rewrite template
foreach my $empty (@empty) {
    if ( grep {  exists $left_temp{$_} } $empty-1) {
        #add column width of empty to the column width of col to left 
+of empty, splice out empty
        my $prev_col =  $templs[$empty-1];
        my $empty_col_width = $templs[$empty]; #get empty col value
        $empty_col_width =~ s/(a)(\d{1,2})(x\d)/$2/; #extract width fr
+om col value
        $prev_col =~ m/(a)(\d{1,2})(x\d)/; #match width into $2
        my $newwidth = $2+$empty_col_width+1;
        $prev_col =~ s/(a)(\d{1,2})(x\d)/$1$newwidth$3/; #replacement 
+for the previous column
        splice(@templs,$empty-1,1,$prev_col);
        splice(@templs,$empty,1);
    }
    if ( grep {  exists $right_temp{$_} } $empty+1) {
        #add column width of empty to the column width of col to right
+ of empty, splice out empty
        my $post_col =  $templs[$empty+1];
        my $empty_col_width = $templs[$empty]; 
        $empty_col_width =~ s/(a)(\d{1,2})(x\d)/$2/; 
        $post_col =~ m/(a)(\d{1,2})(x\d)/;
        my $newwidth = $2+$empty_col_width+1;
        $post_col =~ s/(a)(\d{1,2})(x\d)/$1$newwidth$3/; 
        splice(@templs,$empty+1,1,$post_col);
        splice(@templs,$empty,1);
    }
    
}    
$templ = join ' ', @templs;
print "new template:      " . $templ . "\n\n";

## PM jdporter
sub remove_duplicates(\@)
{
    my $ar = shift;
    my %seen;
    for ( my $i = 0; $i <= $#{$ar} ; )
    {
        splice @$ar, --$i, 1
            if $seen{$ar->[$i++]}++;
    }
}
#####################
# END false cols code
#####################

## Split the records and output delimited by '|'
print join '|', unpack $templ, $_ for @lines;

__DATA__
The First One Here Is Longer. Collie SN      262287630  77312    93871
+  MVP
A  Second (PART) here         First In 20 MT 169287655  506666   61066
+  RTD
3rd Person "Something"        X&Y No SH      564287705  34529    52443
+  RTE
The Fourth Person 20          MLP 4000       360505504  2237     72201
+  VRE
The Fifth Name OR Something   Twin 200 SH    469505179  3530     72201
+  VR
The Sixth Person OR Item      MLP            260505174  3,530   72,201
+  VRE
70 The Seventh Record         MLP            764205122  3530     72201
+  VRE
The Eighth Person MLP         MLP            160545154  3530      7220
+  VRE
[download]

Comment on Remove empty column(s) from unpack template Download Code

Replies are listed 'Best First'.
Re: Remove empty column(s) from unpack template by BrowserUk (Patriarch) on Jul 31, 2007 at 04:42 UTC
Goddamn that's fiddly. So many nasty edge cases. However, I'm fairly sure this covers them all, and it's a little simpler I think: #! perl -slw use strict; my @lines = <DATA>; my $mask = chr(32) x length $lines[ 0 ]; $mask \|= $_ for @lines; my $templ = ''; $templ .= 'a' . length( $1 ) . 'x' . length( $2 ) . ' ' while $mask =~ m[([^ ]+)( +\|$)]g; print $templ; ## Count the false fields and determin the left/right alignments: my( @left, @right, @blanks, $nFields ); for ( @lines ) { my @fields = unpack $templ, $_; $nFields = $#fields; for my $field ( 0 .. $#fields ) { $right[ $field ] = 1 if substr( $fields[ $field ], 0, 1 ) eq +' '; $left[ $field ] = 1 if substr( $fields[ $field ], -1 ) eq +' '; $blanks[ $field ]++ unless $fields[ $field ] =~ +m[\S]; } } my $reField = qr[a\d+x\d+\s]; ## Simplifies template adjustment regex. ## adjusted template fields backwards ## to ensure that we don't screw up the indexing. for my $field ( reverse 0 .. $nFields ) { ## Skip unless a false field next unless defined $blanks[ $field ] and $blanks[ $field ] > ( @lines / 2 ); my $keep; ## Number of template fields to keep ## If the preceding field is left aligned ## and the following is not right aligned if( $field and $left[ $field -1 ] and $field < $nFields and not $right[ $field + 1 ] ) { warn "Amalgamating field $field with the prevous field\n"; $keep = $field - 1; } ## If the preceding field is right aligned ## and the following is not left aligned elsif( $field < $nFields and $right[ $field + 1 ] and $field and not $left[ $field - 1 ] ) { warn "Amalgamating field $field with the next field\n"; $keep = $field; } ## If preceding is left and following is right aligned ## bellyache and do nothing. else { warn "Field $field is probably a false field, but it is not po +ssible\n" . "to determine which adjacent field to amalgamate it with? +"; next; ## Update. } ## Amalgamate the appropriate template fields $templ =~ s[ ( ${reField}{$keep} ) a (\d+) x (\d+) \s a (\d+) x (\d+) \s ][ $1 . 'a' . ($2 + $3 + $4) . 'x' . $5 . ' ' ]xe or warn 'No match'; ## Belache if the regex fails. } print $templ; { local $\; ## Split the records and output delimited by '\|' print join '\|', unpack $templ, $_ for @lines; } __DATA__ The First One Here Is Longer. Collie SN 262287630 77312 93871 + MVP abc de gh A Second (PART) here First In 20 MT 169287655 506666 61066 + RTD abc fgh 3rd Person "Something" X&Y No SH 564287705 45423 52443 + RTE abc gh The Fourth Person 20 MLP 4000 360505504 3530 72201 + VRE abc gh The Fifth Name OR Something Twin 200 SH 469505179 3530 72201 + VRE abc fgh The Sixth Person OR Item MLP 260505174 3,530 72,201 + VRE abc fgh 70 The Seventh Record MLP 764205122 3530 72201 + VRE abc gh The Eighth Person MLP MLP 160545154 3530 7220 + VRE abc gh [download] Produces (Note: the additional test fields): c:\test>628055 a29x1 a11x1 a2x1 a9x2 a6x2 a6x2 a3x2 a3x1 a2x1 a4x0 Amalgamating field 8 with the next field Amalgamating field 2 with the prevous field a29x1 a14x1 a9x2 a6x2 a6x2 a3x2 a3x1 a7x0 The First One Here Is Longer.\|Collie SN \|262287630\|77312 \| 93871\|M +VP\|abc\|de gh A Second (PART) here \|First In 20 MT\|169287655\|506666\| 61066\|R +TD\|abc\| fgh 3rd Person "Something" \|X&Y No SH \|564287705\|45423 \| 52443\|R +TE\|abc\| gh The Fourth Person 20 \|MLP 4000 \|360505504\|3530 \| 72201\|V +RE\|abc\| gh The Fifth Name OR Something \|Twin 200 SH \|469505179\|3530 \| 72201\|V +RE\|abc\| fgh The Sixth Person OR Item \|MLP \|260505174\|3,530 \|72,201\|V +RE\|abc\| fgh 70 The Seventh Record \|MLP \|764205122\|3530 \| 72201\|V +RE\|abc\| gh The Eighth Person MLP \|MLP \|160545154\|3530 \| 7220\|V +RE\|abc\| gh [download] I've also tested the "I don't know what to do" scenario, though that test is not incorporated here. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Remove empty column(s) from unpack template by duff (Parson) on Jul 30, 2007 at 22:14 UTC
I haven't looked at your code in depth, but I just wanted to plant an idea in your head. One of the things that used to bug me about Text::Autoformat is that it doesn't handle tabular data too well (I don't know if this is still the case). I've created code similar to BrowerUK's in the past but was unhappy about how often it didn't work quite right. With your additions (assuming they work), it seems like you could parameterize the column finding code such that it can work in a variety of situations. Perhaps even enough that it could be patched into Text::Autoformat :-) duff	[reply]
Re^2: Remove empty column(s) from unpack template by TheDamian (Vicar) on Aug 01, 2007 at 20:21 UTC
I do have plans to add table recognition to Text::Autoformat. Specifically, to port the table recognition code already used in Perl6::Perldoc::Parser. Those following this thread might find that code interesting (search for `/Build entire table/`). Damian	[reply] [d/l]
Re^3: Remove empty column(s) from unpack template by duff (Parson) on Aug 01, 2007 at 21:34 UTC
That's excellent Damian! May the universe align such that all of your plans come to fruition. duff	[reply]
Re^2: Remove empty column(s) from unpack template by BrowserUk (Patriarch) on Jul 30, 2007 at 23:04 UTC
How does inferring field positions in fixed record format files relate to outputting data in tabular format? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: Remove empty column(s) from unpack template by duff (Parson) on Jul 31, 2007 at 19:18 UTC
It's not the output so much as it is just recognizing that the data is in a table. Text::Autoformat first has to parse the paragraphs that it's dealing with before it can decide what to do about them. A heuristic could be developed that says "this chunk of data is a table". Once you've got that, if you're going to reformat it, you've got to know where the columns are. The OP's code may be able to serve both purposes. Maybe. :-) duff	[reply]
Re^4: Remove empty column(s) from unpack template by BrowserUk (Patriarch) on Jul 31, 2007 at 20:30 UTC
Re^5: Remove empty column(s) from unpack template by duff (Parson) on Aug 01, 2007 at 14:12 UTC