Records of a particular length may be denoted by starting with particular characters or by the length of the record.

Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.

use strict; use warnings; my %histogram; my %records_of_length; while (<DATA>) { my $record_length = length; my $initial_char = substr($_, 0, 1); $records_of_length{$record_length}++; $histogram{$record_length}{$initial_char}++; } # Review how many distinct record lengths were seen. # If all records of given length start with same char, # rejoice! for my $rec_len (sort {$a <=> $b} keys %histogram) { print "Saw $records_of_length{$rec_len} records"; print " with length $rec_len:\n"; for my $char (sort keys %{$histogram{$rec_len}}) { print "\t$char: "; print $histogram{$rec_len}{$char}, "\n"; } } __DATA__ C4498 John__ Smith___ I0023 widget 004 4.95 I0869 foozle 001 29.50 I7765 gadget 002 340.00 C5678 Mary__ Doe____ I9999 misc__ 003 6.25

prints

Saw 2 records with length 22: C: 2 Saw 4 records with length 24: I: 4

and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".


In reply to Re: how to identify a fixed width file - do a histogram! by Narveson
in thread how to identify a fixed width file by ftumsh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.