parsing sloppy text from columns

A friend was bemoaning a few Linux quota tools that gave hard-to-parse output. It would output a header line, followed by a number of lines of data. An example:

Filesystem  blocks   quota   limit   grace   files ...
 /dev/hdd3   26320   46080   51200            1521 ...
 /dev/hdd4   26320           51200            1060 ...
[download]

My friend's problem was trying to deal with the "empty" columns in the data lines. When just using a naive split(), the empty columns would disappear, throwing off the count.

# controlled search-and-replace to insert column markers
# using the header line as a guide to where the columns lie

$headers = <>;
$headers =~ s/(\S)\s/$1\#/g;
while (<>)
{
  s/\s/ (substr($headers, pos(), 1) eq '#')? '#' : ' ' /eg;
  @_ = split /\s*\#\s*/;
  # @_ now has true columns ready for whitespace trimming
}
[download]

Comment on parsing sloppy text from columns Select or Download Code

Replies are listed 'Best First'.
Re: parsing sloppy text from columns by rob_au (Abbot) on May 08, 2003 at 03:29 UTC
An alternate approach would be to make use of the `unpack` function ... `my @line = unpack( 'A10A8A8A8A8A8', $line );` [download] Although I believe both approaches could be broken by over-field lengths or wrapping of text - But then again, there are no doubtedly better tools for that :-) `perl -le 'print+unpack("N",pack("B32","00000000000000000000001001011000"))'`	[reply] [d/l] [select]
Re: parsing sloppy text from columns by draconis (Scribe) on May 08, 2003 at 17:57 UTC
Another simple yet effective way to eliminate the 'extra' whitespace would be to hit the line with something like this. `@line=~s/ +/ /g;` [download] This should simply eliminate all duplicate whitespace chars and replace them with a single whitespace char so that you then can use a naive split().	[reply] [d/l]
Re^2: parsing sloppy text from columns (not the solution) by Aristotle (Chancellor) on May 10, 2003 at 19:29 UTC
That will break when the columns are not consistently filled with values, f.ex if a value of 0 for some column results in nothing getting printed at all. Makeshifts last the longest.	[reply]
Re: Re^2: parsing sloppy text from columns (not the solution) by draconis (Scribe) on May 12, 2003 at 13:27 UTC
You are absolutely correct. My solution will ONLY work for a data set where one has confidence in the data and what is presented (ie. you know you have 6 columns and always get 6 columns). I appreciate you correcting this - my apologies.	[reply]