Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

how to identify a fixed width file

by ftumsh (Scribe)
on May 14, 2008 at 11:51 UTC ( [id://686498]=perlquestion: print w/replies, xml ) Need Help??

ftumsh has asked for the wisdom of the Perl Monks concerning the following question:

Lo,

I'm trying to identify various types of text file, xml, csv etc.

The idea being that it is presented with a text file and it works outwhat type it is.

The one file format I am having trouble with is fixed width. The definition of a fixed width file being:

1) Text file made up of records (ie LF or CRLF delimited) 2) Different records may be of different lengths 3) Records of a particular may be denoted by starting with particular characters or by the length of the record.

As you know, variants of the above are legion, so I only expect(hope) to get a largish percentage.

The only test I have at the moment is if the length of every record is the same and it's failed the tests for other file types, ie I'm testing for fixed width after all else.

Typically in a simple case a file will contain a header record followed by line records. This will repeat down the file. eg

Hfoobar L123456field2 L... H... L... L... etc

In a more complicated file, the header and line will be split across multiple records eg

Hfield1field2 Ffield1field2 part of header still Afield3 field4 still part of header

Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so programmatically.

The options I have up to press: 1) Try and work out if it's fixed width 2) Say hey, we got this far so it's fixed width (will give false positive on random text files) 3) work out if it's a text file containing prose, if it's not, it's fixed width

The text files my module will be presented with should be computer generated, so prose text is a mistake and not happen too often. The whole point of this is to try and cut out humans trying to identify a file. In other words, I don't expect it to catch every fixed width file.

So, all and any suggestions gratefully received.

John

Replies are listed 'Best First'.
Re: how to identify a fixed width file
by Cody Pendant (Prior) on May 14, 2008 at 12:41 UTC
    The definition of a fixed width file being [...] records may be of different lengths
    There's your problem.


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
      Also...
      Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so programmatically.
      Didn't the natural language recognition folks start out saying something very similar?
Re: how to identify a fixed width file
by moritz (Cardinal) on May 14, 2008 at 12:04 UTC
    The unix utility file is great for generally identifying file types.

    But as for your description of the "fixed width" file format: I just don't understand it, and the part that you showed in the example doesn't look very fixed width to me.

    Maybe you could show us a few samples of that file? (Real samples, where you can see patterns)

    There's a nice trick to determine if something is fixed-width with delimiters: take a long string that consists of the delimiting character, and binary-AND it with many records. If the delimiting character is still there at some places, that is very likely a delimiter within a fixed-width record.

    (But since I don't understand your file format I can't say if that trick is applicable here).

      Here is an implementation of the binary AND algorithm Re^2: Fixed Position Column Records as originally implemented by BrowserUk and modified by me. If @templ is empty or only contains one element then the file is probably not fixed width.

        Excellent. It's similar to moritz' suggestion only with an example which is always better for eejits like me. Thanks for that.
      I think it may be easier to work out if it's a prose file, ie plenty of words and if it is prose then it isn't "fixed width" fwiw, I won't know if the file has delimiters. I'd rather not think about the comma seperated fixed width fields format files I have come across ... Here's the most awkward fixed file I can find. It looks fixed width practically straight away to my eye. The more trained observer will notice it's a weird variation of a tradacoms edi message. This is an example, I must point out that any computer generated text file will be passed to my module and it should have a good go of working out what it is.
      STX 8888888888888 dfdfdf dfdf dfdfdfdfdfs sdfdff +d STXA TYP 0700 dfderf SRT 2323232323235 sdertryh aswedrfg gfrfgtgs fgt SRTAHigh Cross CRRtrR dfdeereeR dsdd SRTBLoRdoR d34 dfr SRTC 232323232 CRT 8888888888888 RUNELM RuRRlm sdsd sdsdsdsdsds sdsdsdd CRTAsdsdsdss sdsdsdR sdsdR sdy CRTBSystoR sdsdsdsdsdsdsR CRTCLE7 2NF RNA 0000 RNAA RNAB RNAC RNAR RNAE RNAF RNAG FIL 0002 0002 045450 000000 FRT 074550 070520 ACR 0000000000000 CLO 4545454545459 0750 CLOARuRRlm (BFllymRRF) (0750) CLOBURit2, rtrtrt rtrk trtril rtrk rtrRR rtRk rtFd CLOCBFllymRRF rtt2 rtA IRF wewee8 070508 070508 PYT wewewees wewewewewe wewewewe 034438 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 5656566820 256562 070508 070508 266528 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000022 0000000022000 RFch 00000000025000 RFch ILRC00000000300000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22294 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - CrRFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - PiRk ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22393 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - Blughghgh (ghghghghr) (0390) CLOBURit 3, ghghhg hgFd ghghil ghgk Oghgh gh CLOCRoRcFstRr gth ghE IRF 565629 070508 070508 PYT tytytyys tytytytyFl tytytyRs 070508 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 3434343426 242342 070508 070508 266529 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 53652 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000029900 RFch ILRC00000000059800 S 027500 0 ClFssic ShRll ShFpRd BFth Pillow Cr +RFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000029900 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000006 0000000006000 RFch 00000000025000 RFch ILRC00000000250000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000025000 RFch ILRC00000000050000 S 027500 0 URimFt - PiRk
        There are few criteria in this file that you can test for easily and fast:
        • Every line starts with three upper case ASCII letters (m/^[A-Z]{3}/)
        • Lines beginning with R</a> always follow the pattern <c>m/^R[A-Z]{2}(?:[A-Z]|\s*\d+)$/
        • limited character set: The sample file contains only word characters (\w), whitespaces and [()-]

        I don't know if these characteristics are binding, but if they are they could be used to identify such a file after reading some 20 or 50 lines.

Re: how to identify a fixed width file
by Pancho (Pilgrim) on May 14, 2008 at 12:46 UTC

    I think the key is figuring out the criteria by which you can test a file is fixed width and that depends on your requirements. If the criteria is too broad then the validity of the test will decrease to the point where the test is useless.

    A different approach would be to look for a certain pattern in the record identifier and record length, again depending on your requirements for example:

    First record starts with H and second with D third with T. The pattern repeats and all H records, D records and T records are the same length.

    Good Luck

    Pancho
Re: how to identify a fixed width file - do a histogram!
by Narveson (Chaplain) on May 14, 2008 at 14:39 UTC
    Records of a particular length may be denoted by starting with particular characters or by the length of the record.

    Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character.

    use strict; use warnings; my %histogram; my %records_of_length; while (<DATA>) { my $record_length = length; my $initial_char = substr($_, 0, 1); $records_of_length{$record_length}++; $histogram{$record_length}{$initial_char}++; } # Review how many distinct record lengths were seen. # If all records of given length start with same char, # rejoice! for my $rec_len (sort {$a <=> $b} keys %histogram) { print "Saw $records_of_length{$rec_len} records"; print " with length $rec_len:\n"; for my $char (sort keys %{$histogram{$rec_len}}) { print "\t$char: "; print $histogram{$rec_len}{$char}, "\n"; } } __DATA__ C4498 John__ Smith___ I0023 widget 004 4.95 I0869 foozle 001 29.50 I7765 gadget 002 340.00 C5678 Mary__ Doe____ I9999 misc__ 003 6.25

    prints

    Saw 2 records with length 22: C: 2 Saw 4 records with length 24: I: 4

    and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".

Re: how to identify a fixed width file
by dragonchild (Archbishop) on May 14, 2008 at 13:56 UTC
    The reason why XML, CSV, and other similar file formats were created was to address the inherent problems with fixed with formats. THe first formats were fixed width because they are very simple to work with. In essence, they are the serialization of an array of structs in C. So, marshalling one of those in C is really simple. Finding a given record when you know its index (10th, 1024th, etc) is very simple. Overwriting a given record is very simple. It's the ultimate in RAM-backed-to-disk. The only problem is that you have to know the mapping. If you don't know what a fixed-width format means, you're out of luck.

    And, furthermore, many fixed-width files have a header and, possibly, a footer. DBM::Deep's file format is a record-based format with a two headers (first is fixed, second is variable). Good luck detecting that it's a DBM::Deep file without recognizing the first four bytes.

    Frankly, I'd do the following:

    1. Is it XML, CSV, HTML, etc?
    2. Is it a fixed-width format I recognize (PNG, JPG, DOC, XLS, etc)?
    3. Punt.
    Which, essentially, is what the file utility does.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      1) I do recognise XML, CSV etc already, the problem is with fixed width.
      
      2) The formats mentioned are not text files so are of no relevance.
      
      3) I'd rather not punt if possible, tho it seems I may have to.
      
      My code atm uses File::MMagic to get the mime type. If it's a text file I then work out what sort of text file it is ie
      
      1) XML - uses mmagic and XML::LibXML
      2) SAIFFE - regex
      3) EDIFACT - regex
      4) Tradacoms - regex
      5) CSV - Text::CSV_XS
      6) Fixed width - foobar
      
Re: how to identify a fixed width file
by jhourcle (Prior) on May 14, 2008 at 16:41 UTC

    First off, I don't know if I'd specifically call your format 'fixed width', as it doesn't match what I'm used to dealing with -- simple tabular data with lots of whitespace. I haven't had to deal with the formatting you're dealing with, but I could probably deal with whitespace padded tabular data in a consistent manner.

    Although this probably will have some false negatives for the odd files that I deal with, I'd probably take some subset of the middle of the file (ie, try to remove headers and footers), and then use something like BrowserUK's unpack mask generator to see if there are columns of consistently white space among columns of non-whitespace.

    Obviously, this is going to fail in the case if you include the header or footer, and there's a good chance of it not matching multiline records (but still fixed width) or if there are sub-headings of substantial length. Many of the fixed-width files I deal with have various formatting quirks, but if yours are more consistent, it might be worthwhile.

    for the case where you don't have whitespace padding, but you do have data other than strings, you might be able to create masks of where there's numeric vs. alpha columns, and make your decision based on that. (still wouldn't deal with the multi-line record issue, though)

Re: how to identify a fixed width file
by reasonablekeith (Deacon) on May 14, 2008 at 15:33 UTC
    Why don't you try running through the file counting up the number of times a line of a given length is seen...
    my %line_count_by_length; while (<DATA>) { my $line_length = length($_); $line_count_by_length{$line_length}++; }
    If any (or a sufficiently large portion of) those line counts represent a big percentage of the total line count, you could make a guess that the file was fixed width. Perhaps also giving a weighting on how many different line lengths are represented in the file, compared to how many you might expect given the file's length?
    ---
    my name's not Keith, and I'm not reasonable.

      My initial stab was a count of record lengths which was fine until the different length files cropped up.

      I think bringing that back along with some analysis of the counts, along with tachyon/mortitz' text OR should go a long way to solving this

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://686498]
Approved by citromatik
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2024-04-19 08:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found