Lo,

I'm trying to identify various types of text file, xml, csv etc.

The idea being that it is presented with a text file and it works outwhat type it is.

The one file format I am having trouble with is fixed width. The definition of a fixed width file being:

1) Text file made up of records (ie LF or CRLF delimited) 2) Different records may be of different lengths 3) Records of a particular may be denoted by starting with particular characters or by the length of the record.

As you know, variants of the above are legion, so I only expect(hope) to get a largish percentage.

The only test I have at the moment is if the length of every record is the same and it's failed the tests for other file types, ie I'm testing for fixed width after all else.

Typically in a simple case a file will contain a header record followed by line records. This will repeat down the file. eg

Hfoobar L123456field2 L... H... L... L... etc

In a more complicated file, the header and line will be split across multiple records eg

Hfield1field2 Ffield1field2 part of header still Afield3 field4 still part of header

Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so programmatically.

The options I have up to press: 1) Try and work out if it's fixed width 2) Say hey, we got this far so it's fixed width (will give false positive on random text files) 3) work out if it's a text file containing prose, if it's not, it's fixed width

The text files my module will be presented with should be computer generated, so prose text is a mistake and not happen too often. The whole point of this is to try and cut out humans trying to identify a file. In other words, I don't expect it to catch every fixed width file.

So, all and any suggestions gratefully received.

John


In reply to how to identify a fixed width file by ftumsh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.