Re: how to identify a fixed width file

Records of a particular length may be denoted by starting with particular characters or by the length of the record.

Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.

use strict;
use warnings;

my %histogram;
my %records_of_length;
while (<DATA>) {
    my $record_length = length;
    my $initial_char  = substr($_, 0, 1);
    $records_of_length{$record_length}++;
    $histogram{$record_length}{$initial_char}++;
}

# Review how many distinct record lengths were seen.
# If all records of given length start with same char,
# rejoice!
for my $rec_len (sort {$a <=> $b} keys %histogram) {
    print "Saw $records_of_length{$rec_len} records";
    print " with length $rec_len:\n";
    for my $char (sort keys %{$histogram{$rec_len}}) {
        print "\t$char: ";
        print $histogram{$rec_len}{$char}, "\n";
    }
}
__DATA__
C4498 John__ Smith___
I0023 widget 004   4.95
I0869 foozle 001  29.50
I7765 gadget 002 340.00
C5678 Mary__  Doe____
I9999 misc__ 003   6.25
[download]

prints

Saw 2 records with length 22:
        C: 2
Saw 4 records with length 24:
        I: 4
[download]

and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".

Comment on Re: how to identify a fixed width file - do a histogram! Select or Download Code