Here are another couple of useful tips:
- Use hexdump or a similar tool to examine the contents of the file byte-by-byte. Don’t assume anything: a “blank space” could be tabs, spaces, or even characters that are unprintable according to the internationalization (I18N) settings of whatever tool you may happen to be using. When you are showing excerpts of such files to us, enclose them in <code> tags. You can write a program to split according to any sort of bright-line rule.
- Once you think you have a bright-line rule, write a script to prove it. Take every assumption that you think holds true for the entire catalog of such files that you have, then write scripts that will survive only-if those assumptions are correct; otherwise they die in a meaningful way. Run those scripts against a broad cross-section of the files. Run them automatically against new files that come in. (Sometimes you find that you are debugging, not only the programs that you wrote to consume the files, but the programs that other people wrote to produce them.)