how to identify a fixed width file

ftumsh has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to identify a fixed width file by Cody Pendant (Prior) on May 14, 2008 at 12:41 UTC
The definition of a fixed width file being [...] records may be of different lengths There's your problem. Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply]
Re^2: how to identify a fixed width file by Anonymous Monk on May 14, 2008 at 14:29 UTC
Also... Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so programmatically. Didn't the natural language recognition folks start out saying something very similar?	[reply]
Re: how to identify a fixed width file by moritz (Cardinal) on May 14, 2008 at 12:04 UTC
The unix utility `file` is great for generally identifying file types. But as for your description of the "fixed width" file format: I just don't understand it, and the part that you showed in the example doesn't look very fixed width to me. Maybe you could show us a few samples of that file? (Real samples, where you can see patterns) There's a nice trick to determine if something is fixed-width with delimiters: take a long string that consists of the delimiting character, and binary-AND it with many records. If the delimiting character is still there at some places, that is very likely a delimiter within a fixed-width record. (But since I don't understand your file format I can't say if that trick is applicable here).	[reply] [d/l]
Re^2: how to identify a fixed width file by tachyon-II (Chaplain) on May 14, 2008 at 14:08 UTC
Here is an implementation of the binary AND algorithm Re^2: Fixed Position Column Records as originally implemented by BrowserUk and modified by me. If @templ is empty or only contains one element then the file is probably not fixed width.	[reply]
Re^3: how to identify a fixed width file by ftumsh (Scribe) on May 14, 2008 at 15:26 UTC
Excellent. It's similar to moritz' suggestion only with an example which is always better for eejits like me. Thanks for that.	[reply]
Re^2: how to identify a fixed width file by ftumsh (Scribe) on May 14, 2008 at 13:25 UTC
I think it may be easier to work out if it's a prose file, ie plenty of words and if it is prose then it isn't "fixed width" fwiw, I won't know if the file has delimiters. I'd rather not think about the comma seperated fixed width fields format files I have come across ... Here's the most awkward fixed file I can find. It looks fixed width practically straight away to my eye. The more trained observer will notice it's a weird variation of a tradacoms edi message. This is an example, I must point out that any computer generated text file will be passed to my module and it should have a good go of working out what it is. STX 8888888888888 dfdfdf dfdf dfdfdfdfdfs sdfdff +d STXA TYP 0700 dfderf SRT 2323232323235 sdertryh aswedrfg gfrfgtgs fgt SRTAHigh Cross CRRtrR dfdeereeR dsdd SRTBLoRdoR d34 dfr SRTC 232323232 CRT 8888888888888 RUNELM RuRRlm sdsd sdsdsdsdsds sdsdsdd CRTAsdsdsdss sdsdsdR sdsdR sdy CRTBSystoR sdsdsdsdsdsdsR CRTCLE7 2NF RNA 0000 RNAA RNAB RNAC RNAR RNAE RNAF RNAG FIL 0002 0002 045450 000000 FRT 074550 070520 ACR 0000000000000 CLO 4545454545459 0750 CLOARuRRlm (BFllymRRF) (0750) CLOBURit2, rtrtrt rtrk trtril rtrk rtrRR rtRk rtFd CLOCBFllymRRF rtt2 rtA IRF wewee8 070508 070508 PYT wewewees wewewewewe wewewewe 034438 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 5656566820 256562 070508 070508 266528 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000022 0000000022000 RFch 00000000025000 RFch ILRC00000000300000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22294 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - CrRFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - PiRk ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22393 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - BluR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 CIA 000000 00000000000000 RNC 0000 RNCA RNCB RNCC RNCR RNCE RNCF RNCG STL S 027500 0000000004 000000005250 000000000000 000000000000 0000000 +00000 STLA000000000000 000000005250 000000000232 000000005228 000000000896 STLB000000006246 000000006024 TLR 0000000002 000000005250 000000000000 000000000000 000000000000 000 +000000000 TLRA000000005250 000000000232 000000005228 000000000896 000000006246 TLRB000000006024 CLO 5656565656567 0390 CLOAghghgh (ghghghghr) (0390) CLOBURit 3, ghghhg hgFd ghghil ghgk Oghgh gh CLOCRoRcFstRr gth ghE IRF 565629 070508 070508 PYT tytytyys tytytytyFl tytytyRs 070508 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 3434343426 242342 070508 070508 266529 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 53652 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000029900 RFch ILRC00000000059800 S 027500 0 ClFssic ShRll ShFpRd BFth Pillow Cr +RFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000029900 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000006 0000000006000 RFch 00000000025000 RFch ILRC00000000250000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000025000 RFch ILRC00000000050000 S 027500 0 URimFt - PiRk ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 CIA 000000 00000000000000 RNC 0000 RNCA RNCB RNCC RNCR RNCE RNCF RNCG STL S 027500 0000000003 000000002598 000000000000 000000000000 0000000 +00000 STLA000000000000 000000002598 000000000066 000000002532 000000000443 STLB000000003042 000000002975 TLR 0000000002 000000002598 000000000000 000000000000 000000000000 000 +000000000 TLRA000000002598 000000000066 000000002532 000000000443 000000003042 TLRB000000002975 [download]	[reply] [d/l]
Re^3: how to identify a fixed width file by moritz (Cardinal) on May 14, 2008 at 15:49 UTC
There are few criteria in this file that you can test for easily and fast: Every line starts with three upper case ASCII letters (`m/^[A-Z]{3}/`) Lines beginning with `R</a> always follow the pattern <c>m/^R[A-Z]{2}(?:[A-Z]\|\s*\d+)$/` limited character set: The sample file contains only word characters (`\w`), whitespaces and `[()-]` I don't know if these characteristics are binding, but if they are they could be used to identify such a file after reading some 20 or 50 lines.	[reply] [d/l] [select]
Re: how to identify a fixed width file by Pancho (Pilgrim) on May 14, 2008 at 12:46 UTC
I think the key is figuring out the criteria by which you can test a file is fixed width and that depends on your requirements. If the criteria is too broad then the validity of the test will decrease to the point where the test is useless. A different approach would be to look for a certain pattern in the record identifier and record length, again depending on your requirements for example: First record starts with H and second with D third with T. The pattern repeats and all H records, D records and T records are the same length. Good Luck Pancho	[reply]
Re: how to identify a fixed width file - do a histogram! by Narveson (Chaplain) on May 14, 2008 at 14:39 UTC
Records of a particular length may be denoted by starting with particular characters or by the length of the record. Some of the brethren have boggled at fixed-width files that mix different record lengths, but I think we can make some sense of this, especially if the record type is signaled by the initial character. use strict; use warnings; my %histogram; my %records_of_length; while (<DATA>) { my $record_length = length; my $initial_char = substr($_, 0, 1); $records_of_length{$record_length}++; $histogram{$record_length}{$initial_char}++; } # Review how many distinct record lengths were seen. # If all records of given length start with same char, # rejoice! for my $rec_len (sort {$a <=> $b} keys %histogram) { print "Saw $records_of_length{$rec_len} records"; print " with length $rec_len:\n"; for my $char (sort keys %{$histogram{$rec_len}}) { print "\t$char: "; print $histogram{$rec_len}{$char}, "\n"; } } __DATA__ C4498 John__ Smith___ I0023 widget 004 4.95 I0869 foozle 001 29.50 I7765 gadget 002 340.00 C5678 Mary__ Doe____ I9999 misc__ 003 6.25 [download] prints `Saw 2 records with length 22: C: 2 Saw 4 records with length 24: I: 4` [download] and now you can work on heuristics to decide if the number of different record types is small enough to usefully classify the file as "mixed fixed width".	[reply] [d/l] [select]
Re: how to identify a fixed width file by dragonchild (Archbishop) on May 14, 2008 at 13:56 UTC
The reason why XML, CSV, and other similar file formats were created was to address the inherent problems with fixed with formats. THe first formats were fixed width because they are very simple to work with. In essence, they are the serialization of an array of structs in C. So, marshalling one of those in C is really simple. Finding a given record when you know its index (10th, 1024th, etc) is very simple. Overwriting a given record is very simple. It's the ultimate in RAM-backed-to-disk. The only problem is that you have to know the mapping. If you don't know what a fixed-width format means, you're out of luck. And, furthermore, many fixed-width files have a header and, possibly, a footer. DBM::Deep's file format is a record-based format with a two headers (first is fixed, second is variable). Good luck detecting that it's a DBM::Deep file without recognizing the first four bytes. Frankly, I'd do the following: Is it XML, CSV, HTML, etc? Is it a fixed-width format I recognize (PNG, JPG, DOC, XLS, etc)? Punt. Which, essentially, is what the file utility does. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: how to identify a fixed width file by ftumsh (Scribe) on May 14, 2008 at 14:17 UTC
1) I do recognise XML, CSV etc already, the problem is with fixed width. 2) The formats mentioned are not text files so are of no relevance. 3) I'd rather not punt if possible, tho it seems I may have to. My code atm uses File::MMagic to get the mime type. If it's a text file I then work out what sort of text file it is ie 1) XML - uses mmagic and XML::LibXML 2) SAIFFE - regex 3) EDIFACT - regex 4) Tradacoms - regex 5) CSV - Text::CSV_XS 6) Fixed width - foobar	[reply]
Re: how to identify a fixed width file by jhourcle (Prior) on May 14, 2008 at 16:41 UTC
First off, I don't know if I'd specifically call your format 'fixed width', as it doesn't match what I'm used to dealing with -- simple tabular data with lots of whitespace. I haven't had to deal with the formatting you're dealing with, but I could probably deal with whitespace padded tabular data in a consistent manner. Although this probably will have some false negatives for the odd files that I deal with, I'd probably take some subset of the middle of the file (ie, try to remove headers and footers), and then use something like BrowserUK's unpack mask generator to see if there are columns of consistently white space among columns of non-whitespace. Obviously, this is going to fail in the case if you include the header or footer, and there's a good chance of it not matching multiline records (but still fixed width) or if there are sub-headings of substantial length. Many of the fixed-width files I deal with have various formatting quirks, but if yours are more consistent, it might be worthwhile. for the case where you don't have whitespace padding, but you do have data other than strings, you might be able to create masks of where there's numeric vs. alpha columns, and make your decision based on that. (still wouldn't deal with the multi-line record issue, though)	[reply]
Re: how to identify a fixed width file by reasonablekeith (Deacon) on May 14, 2008 at 15:33 UTC
Why don't you try running through the file counting up the number of times a line of a given length is seen... `my %line_count_by_length; while (<DATA>) { my $line_length = length($_); $line_count_by_length{$line_length}++; }` [download] If any (or a sufficiently large portion of) those line counts represent a big percentage of the total line count, you could make a guess that the file was fixed width. Perhaps also giving a weighting on how many different line lengths are represented in the file, compared to how many you might expect given the file's length? --- my name's not Keith, and I'm not reasonable.	[reply] [d/l]
Re^2: how to identify a fixed width file by ftumsh (Scribe) on May 14, 2008 at 15:48 UTC
My initial stab was a count of record lengths which was fine until the different length files cropped up. I think bringing that back along with some analysis of the counts, along with tachyon/mortitz' text OR should go a long way to solving this	[reply]


Keep It Simple, Stupid
	PerlMonks