New module -- Fixed width data solution

Replies are listed 'Best First'.

Re: New module -- Fixed width data solution
by talexb (Chancellor) on May 05, 2008 at 13:54 UTC

I note that this module requires Perl 5.10 -- is that noted anywhere other than the CPAN page or in your original node, reproduced below?

Today is a great day for perl. The most significant module to ever grace cpan has been uploaded. It is my pleasure to announce DataExtract::FixedWidth. Finally a sexy way to deal with "databases" that were coughed out of applications that suck. This module does not suck. It rocks. It pwns. It is l33t.

It was written using Moose by the coolest mofo around. He rocks. Check it out and give me feedback plz. Thanks.

UPDATE: This module is an attempt at a full fledged out-of-the-box solution to fixed-column-width tables. These types of tables are often outputted by ghostscript or a tab separated file that went through some sort of unexpanded. Humans deal with this better because we can look at the table as a whole simply. This module makes this form of data infinitely more user-friendly. etc.

Evan Carroll
I hack for the ladies.
www.EvanCarroll.com

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

[reply]

Re: New module -- Fixed width data solution
by runrig (Abbot) on May 05, 2008 at 23:38 UTC

Two

times

It's been pointed out before

[reply]

Re^2: New module -- Fixed width data solution

by stvn (Monsignor) on May 07, 2008 at 14:44 UTC

but Moose is an awful lot to load to parse fixed-width data.

Load time for Moose has been going down lately, so that complaint is starting to become more FUD then Fact (a few lines of XS recently made Moose 45% less slow at startup). Also I would put it that when parsing large data files, your applications performance will be completely IO-bound and the extra couple hundred milliseconds that Moose requires at startup (assuming the class is made immutable) will not be an issue.

-stvn

[reply]

Re^3: New module -- Fixed width data solution

by runrig (Abbot) on May 07, 2008 at 18:30 UTC

Load time for Moose has been going down lately...

time

The most significant module to ever grace cpan

[reply]

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: New module -- Fixed width data solution
by Jenda (Abbot) on May 05, 2008 at 16:10 UTC

Too automated for my tastes. For example why do you think the column names never have spaces? What if the header with the column names spans several lines? (To accomodate for longer multiword column headings.) What about the line of dashes separating the headers from the data in quite a few fixed width reports? How do I get rid of page numbers and form feeds? How do you know whether the column names are left or right aligned?

Jenda
Support Denmark!
Defend the free world!

[reply]

A reply falls below the community's threshold of quality. You may see it by logging in.

Re: New module -- Fixed width data solution
by jhourcle (Prior) on May 05, 2008 at 20:17 UTC

I have to agree with Jenda -- although I could see use for this in some cases (although I don't have 5.10 installed, so can't test it right now), I often have need for:

Support for lack of headers (the default when dumping MS Access ... and I think Excel, too) ... possibly just give array but no hash access. (I've also had situations w/ incorrect headers ... two (four) columns were inserted between 'Accel' and 'MPA')
Stripping header / footer lines
Variable width final columns (possibly missing)

Less likely for a generic tool, but other issues I run into include:

Subheadings (so you have two column formats, and subheading data trickles down to the records under it)
processing pre-formatted HTML (you have to strip the HTML, but keep track of which column it was in so you can assign color / links to a consistent field)

I currently use a script which can handle the first two cases (well, the second one, which handles the first one indirectly, and then passes the remainder to an unpack mask generator from BrowserUK, but modified to handle the third issue

[reply]

Re^2: New module -- Fixed width data solution

by EvanCarroll (Chaplain) on May 05, 2008 at 22:34 UTC

Evan Carroll
I hack for the ladies.
www.EvanCarroll.com

[reply]

Re^3: New module -- Fixed width data solution

by jhourcle (Prior) on May 06, 2008 at 02:20 UTC

The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option.

Sorry -- I couldn't test the code (don't have 5.10, nor Moose installed), and didn't notice that you correctly handled that logic in your code.

The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it.

See BrowserUK's code that does just that -- it determines where in the data there are consistently spaces, and then uses that as the column boundaries. Obviously, that's not true in all cases, as I often have to process lines such as:

Read more... (311 Bytes)

... where the only thing significant is that the character changes ... you could use specific characters to signal different data types (ie, is it handled as a string, numeric, boolean?). Of course, you'd need two characters for each one, so those times when two of the same field abut without whitespace between them.

The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process.

That's one approach -- but trust me in this -- of the type of data I process, this happens so often that I just want to pass in the number of lines of header/footers, or a regex to denote where to start/stop. If I have to go to the trouble of wrapping your code to handle this rather elementary task, I'm just not going to use it -- I'm going to use my own, as I don't see any real advantage otherwise -- it's not worth forcing people to update Perl and install Moose just to do this sort of work. If you're going to bill your module as ' The most significant module to ever grace cpan', I'd have expected a little bit more.

(the parsers I'm writing are for people to parse scientific catalogs, to keep SQL databases synced up with the authoritative records, and I try to keep the necessary install to a minimum ... I don't even require DBD -- I generate CSV files and the necessary load routines for the database ... but that's also a performance tuning issue.

talk

[reply]
[d/l]
[select]

Re^4: New module -- Fixed width data solution

by EvanCarroll (Chaplain) on May 06, 2008 at 03:11 UTC

Re: New module -- Fixed width data solution
by runrig (Abbot) on May 07, 2008 at 00:32 UTC

It does not, e.g., seem to correctly parse the output of "ps -f" or "ps -l" (update: though I will give it credit for correctly parsing plain ol' "ps"). Probably because the column headers are sometimes right-justified and sometimes left-justified. And there doesn't seem to be any options to tweak the results to be correct.

[reply]

Re^2: New module -- Fixed width data solution

by EvanCarroll (Chaplain) on May 17, 2008 at 23:30 UTC

Evan Carroll
I hack for the ladies.
www.EvanCarroll.com

[reply]

Re: New module -- Fixed width data solution
by apl (Monsignor) on May 07, 2008 at 18:50 UTC

You seem to be incredibly sensitive to perceived criticism, so please don't take this the wrong way: What's the net advantage of using this module as opposed to doing an unpack?