in reply to how to identify a fixed width file

Records of a particular length may be denoted by starting with particular characters or by the length of the record.

Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.

use strict; use warnings; my %histogram; my %records_of_length; while (<DATA>) { my $record_length = length; my $initial_char = substr($_, 0, 1); $records_of_length{$record_length}++; $histogram{$record_length}{$initial_char}++; } # Review how many distinct record lengths were seen. # If all records of given length start with same char, # rejoice! for my $rec_len (sort {$a <=> $b} keys %histogram) { print "Saw $records_of_length{$rec_len} records"; print " with length $rec_len:\n"; for my $char (sort keys %{$histogram{$rec_len}}) { print "\t$char: "; print $histogram{$rec_len}{$char}, "\n"; } } __DATA__ C4498 John__ Smith___ I0023 widget 004 4.95 I0869 foozle 001 29.50 I7765 gadget 002 340.00 C5678 Mary__ Doe____ I9999 misc__ 003 6.25

prints

Saw 2 records with length 22: C: 2 Saw 4 records with length 24: I: 4

and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".