Records of a particular length may be denoted by starting with particular characters or by the length of the record.
Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.
use strict; use warnings; my %histogram; my %records_of_length; while (<DATA>) { my $record_length = length; my $initial_char = substr($_, 0, 1); $records_of_length{$record_length}++; $histogram{$record_length}{$initial_char}++; } # Review how many distinct record lengths were seen. # If all records of given length start with same char, # rejoice! for my $rec_len (sort {$a <=> $b} keys %histogram) { print "Saw $records_of_length{$rec_len} records"; print " with length $rec_len:\n"; for my $char (sort keys %{$histogram{$rec_len}}) { print "\t$char: "; print $histogram{$rec_len}{$char}, "\n"; } } __DATA__ C4498 John__ Smith___ I0023 widget 004 4.95 I0869 foozle 001 29.50 I7765 gadget 002 340.00 C5678 Mary__ Doe____ I9999 misc__ 003 6.25
prints
Saw 2 records with length 22: C: 2 Saw 4 records with length 24: I: 4
and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".
In reply to Re: how to identify a fixed width file - do a histogram!
by Narveson
in thread how to identify a fixed width file
by ftumsh
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |