Parsing irregularly delimited data

space_cadet has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have some very large ugly data files with irregularly delimited (or non-delimited) elements. The files look like this:

Bob's Company   VA chickens and cows April 23, 2003 2365
Elizabeth P. Jones Inc.WY   widgets  February 4, 2003 4
Big Huge CorporationUSAserversworkstationsrouters April 17, 2003 99999
[download]

After some manipulations I've been able to parse based on
two or more whitespaces:

while (<READFILE>) {
      chomp;
      @columns = split/\s{2,}/g, $_, 7;
}
[download]

I was wondering if anyone can think of a better way to do this. It took a great deal of work to reliably insert two or more whitespaces between valid data elements. Also, the files are quite large, and what if I wasn't assured of having a fixed number of elements?

Edit by tye, remove BR tags, use CODE tags so extra spaces are visible

Comment on Parsing irregularly delimited data Select or Download Code

Replies are listed 'Best First'.
Re: Parsing irregularly delimited data by Aragorn (Curate) on Apr 29, 2003 at 20:31 UTC
With (mostly) irregularly formatted data, you always have a portion which can't be chopped up neatly. Try to match most of the lines with a `split` or regular expression, and "redirect" the non-matching lines to another file for examining by hand or running some other program over. Arjen	[reply] [d/l]
Re: Parsing irregularly delimited data by cLive ;-) (Prior) on Apr 29, 2003 at 20:22 UTC
It might help if you give an example of what you want the data to look like when extracted! A regular expression might be a better idea. Hint: `[A-Z]{2} \w+\s+\d{1,2},\s+\d{4}\s+\d+` [download] .02 cLive ;-)	[reply] [d/l]
Re: Re: Parsing irregularly delimited data by space_cadet (Initiate) on Apr 29, 2003 at 20:54 UTC
Thanks. A regular expression is probably the way to go when one doesn't know the number of elements. For eventual processing, all I need is for the data to be nicely delimited. "Bob's Company", "VA", "chickens and cows", "April 23 2003", 2365	[reply]
Re: Parsing irregularly delimited data by artist (Parson) on Apr 29, 2003 at 21:02 UTC
By looking at your data, here is the script which uses cLive ;-)'s regex, with surrounding details. `while(<DATA>){ ($companyinfo,$date,$number) = $_=~m[^(.*?)\s+(\w+\s+\d{1,2}\,\s+\d{4})\s+(\d+)$]; print "$companyinfo\n$date\n$number\n\n"; } __DATA__ Bob's Company VA chickens and cows April 23, 2003 2365 Elizabeth P. Jones Inc.WY widgets February 4, 2003 4 Big Huge CorporationUSAserversworkstationsrouters April 17, 2003 99999` [download] artist	[reply] [d/l]
Re: Parsing irregularly delimited data by graff (Chancellor) on Apr 30, 2003 at 04:00 UTC
You may want to watch out for "unprintable" characters -- there may be delimiters that are not visible (e.g. null bytes, invisible control codes, etc). It would be worthwhile to follow earlier advice -- yank out the easy parts first, push the hard parts to a separate listing -- and then do a more careful diagnosis of the hard parts. A simple hex-dump of the character data can help, or printing some tabulation of byte values that occur in the data (which might make it easier to spot unexpected bytes) -- e.g.: `#!/usr/bin/perl # chist.perl -- print histogram of byte values # (useful for seeing if text contains invisible characters) while (<>) { @chars = split //; for $c ( split // ) { $chist[ord($c)]++; } } for ( $i=0; $i<256; $i++ ) { printf("%d\tx%0.2x\n", $chist[$i], $i) if ( $chist[$i] ); } # there are ways to reduce that to a fairly short one-liner, # but you may want to add options...` [download]	[reply] [d/l]