in reply to Re: how to identify a fixed width file
in thread how to identify a fixed width file

I think it may be easier to work out if it's a prose file, ie plenty of words and if it is prose then it isn't "fixed width" fwiw, I won't know if the file has delimiters. I'd rather not think about the comma seperated fixed width fields format files I have come across ... Here's the most awkward fixed file I can find. It looks fixed width practically straight away to my eye. The more trained observer will notice it's a weird variation of a tradacoms edi message. This is an example, I must point out that any computer generated text file will be passed to my module and it should have a good go of working out what it is.
STX 8888888888888 dfdfdf dfdf dfdfdfdfdfs sdfdff +d STXA TYP 0700 dfderf SRT 2323232323235 sdertryh aswedrfg gfrfgtgs fgt SRTAHigh Cross CRRtrR dfdeereeR dsdd SRTBLoRdoR d34 dfr SRTC 232323232 CRT 8888888888888 RUNELM RuRRlm sdsd sdsdsdsdsds sdsdsdd CRTAsdsdsdss sdsdsdR sdsdR sdy CRTBSystoR sdsdsdsdsdsdsR CRTCLE7 2NF RNA 0000 RNAA RNAB RNAC RNAR RNAE RNAF RNAG FIL 0002 0002 045450 000000 FRT 074550 070520 ACR 0000000000000 CLO 4545454545459 0750 CLOARuRRlm (BFllymRRF) (0750) CLOBURit2, rtrtrt rtrk trtril rtrk rtrRR rtRk rtFd CLOCBFllymRRF rtt2 rtA IRF wewee8 070508 070508 PYT wewewees wewewewewe wewewewe 034438 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 5656566820 256562 070508 070508 266528 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000022 0000000022000 RFch 00000000025000 RFch ILRC00000000300000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22294 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - CrRFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - PiRk ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22393 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000003 0000000003000 RFch 00000000025000 RFch ILRC00000000075000 S 027500 0 URimFt - BluR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 CIA 000000 00000000000000 RNC 0000 RNCA RNCB RNCC RNCR RNCE RNCF RNCG STL S 027500 0000000004 000000005250 000000000000 000000000000 0000000 +00000 STLA000000000000 000000005250 000000000232 000000005228 000000000896 STLB000000006246 000000006024 TLR 0000000002 000000005250 000000000000 000000000000 000000000000 000 +000000000 TLRA000000005250 000000000232 000000005228 000000000896 000000006246 TLRB000000006024 CLO 5656565656567 0390 CLOAghghgh (ghghghghr) (0390) CLOBURit 3, ghghhg hgFd ghghil ghgk Oghgh gh CLOCRoRcFstRr gth ghE IRF 565629 070508 070508 PYT tytytyys tytytytyFl tytytyRs 070508 002500 000 002500 +000 RNAH0000 RNAI RNAJ RNAK RNAL RNAM RNAN RNAO ORR 3434343426 242342 070508 070508 266529 + 070508 ORRA000000000000002 0000000000000 0000000000000 0705 +08 ORRB 0000000000000 ORRC0000000000000 ORRR ILR 0000000000000 53652 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000029900 RFch ILRC00000000059800 S 027500 0 ClFssic ShRll ShFpRd BFth Pillow Cr +RFm ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000029900 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 20922 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000006 0000000006000 RFch 00000000025000 RFch ILRC00000000250000 S 027500 0 URimFt - WhitR ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 ILR 0000000000000 22270 00000000000000 000000 +0000000 ILRA000000000000000 000000000000002 000 +0000000000 ILRB 000000000000002 0000000002000 RFch 00000000025000 RFch ILRC00000000050000 S 027500 0 URimFt - PiRk ILRR 00000000000000 0000000000 +0000 ILRE00000000000000 00000000000000 00000000025000 00000000000000 000000 ILRF00000000000000 CIA 000000 00000000000000 RNC 0000 RNCA RNCB RNCC RNCR RNCE RNCF RNCG STL S 027500 0000000003 000000002598 000000000000 000000000000 0000000 +00000 STLA000000000000 000000002598 000000000066 000000002532 000000000443 STLB000000003042 000000002975 TLR 0000000002 000000002598 000000000000 000000000000 000000000000 000 +000000000 TLRA000000002598 000000000066 000000002532 000000000443 000000003042 TLRB000000002975

Replies are listed 'Best First'.
Re^3: how to identify a fixed width file
by moritz (Cardinal) on May 14, 2008 at 15:49 UTC
    There are few criteria in this file that you can test for easily and fast:
    • Every line starts with three upper case ASCII letters (m/^[A-Z]{3}/)
    • Lines beginning with R</a> always follow the pattern <c>m/^R[A-Z]{2}(?:[A-Z]|\s*\d+)$/
    • limited character set: The sample file contains only word characters (\w), whitespaces and [()-]

    I don't know if these characteristics are binding, but if they are they could be used to identify such a file after reading some 20 or 50 lines.