space_cadet has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have some very large ugly data files with irregularly delimited (or non-delimited) elements. The files look like this:

Bob's Company VA chickens and cows April 23, 2003 2365 Elizabeth P. Jones Inc.WY widgets February 4, 2003 4 Big Huge CorporationUSAserversworkstationsrouters April 17, 2003 99999
After some manipulations I've been able to parse based on
two or more whitespaces:
while (<READFILE>) { chomp; @columns = split/\s{2,}/g, $_, 7; }
I was wondering if anyone can think of a better way to do this. It took a great deal of work to reliably insert two or more whitespaces between valid data elements. Also, the files are quite large, and what if I wasn't assured of having a fixed number of elements?

Edit by tye, remove BR tags, use CODE tags so extra spaces are visible

Replies are listed 'Best First'.
Re: Parsing irregularly delimited data
by Aragorn (Curate) on Apr 29, 2003 at 20:31 UTC
    With (mostly) irregularly formatted data, you always have a portion which can't be chopped up neatly. Try to match most of the lines with a split or regular expression, and "redirect" the non-matching lines to another file for examining by hand or running some other program over.

    Arjen

Re: Parsing irregularly delimited data
by cLive ;-) (Prior) on Apr 29, 2003 at 20:22 UTC
    It might help if you give an example of what you want the data to look like when extracted!

    A regular expression might be a better idea. Hint:

    [A-Z]{2} \w+\s+\d{1,2},\s+\d{4}\s+\d+

    .02

    cLive ;-)

      Thanks. A regular expression is probably the way to go when one doesn't know the number of elements. For eventual processing, all I need is for the data to be nicely delimited.

      "Bob's Company", "VA", "chickens and cows", "April 23 2003", 2365
Re: Parsing irregularly delimited data
by artist (Parson) on Apr 29, 2003 at 21:02 UTC
    By looking at your data, here is the script which uses cLive ;-)'s regex, with surrounding details.

    while(<DATA>){ ($companyinfo,$date,$number) = $_=~m[^(.*?)\s+(\w+\s+\d{1,2}\,\s+\d{4})\s+(\d+)$]; print "$companyinfo\n$date\n$number\n\n"; } __DATA__ Bob's Company VA chickens and cows April 23, 2003 2365 Elizabeth P. Jones Inc.WY widgets February 4, 2003 4 Big Huge CorporationUSAserversworkstationsrouters April 17, 2003 99999

    artist

Re: Parsing irregularly delimited data
by graff (Chancellor) on Apr 30, 2003 at 04:00 UTC
    You may want to watch out for "unprintable" characters -- there may be delimiters that are not visible (e.g. null bytes, invisible control codes, etc). It would be worthwhile to follow earlier advice -- yank out the easy parts first, push the hard parts to a separate listing -- and then do a more careful diagnosis of the hard parts. A simple hex-dump of the character data can help, or printing some tabulation of byte values that occur in the data (which might make it easier to spot unexpected bytes) -- e.g.:
    #!/usr/bin/perl # chist.perl -- print histogram of byte values # (useful for seeing if text contains invisible characters) while (<>) { @chars = split //; for $c ( split // ) { $chist[ord($c)]++; } } for ( $i=0; $i<256; $i++ ) { printf("%d\tx%0.2x\n", $chist[$i], $i) if ( $chist[$i] ); } # there are ways to reduce that to a fairly short one-liner, # but you may want to add options...