This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on New module -- Fixed width data solution

Replies are listed 'Best First'.
Re: New module -- Fixed width data solution
by talexb (Chancellor) on May 05, 2008 at 13:54 UTC

    I note that this module requires Perl 5.10 -- is that noted anywhere other than the CPAN page or in your original node, reproduced below?

      Today is a great day for perl. The most significant module to ever grace cpan has been uploaded. It is my pleasure to announce DataExtract::FixedWidth. Finally a sexy way to deal with "databases" that were coughed out of applications that suck. This module does not suck. It rocks. It pwns. It is l33t.

      It was written using Moose by the coolest mofo around. He rocks. Check it out and give me feedback plz. Thanks.

      UPDATE: This module is an attempt at a full fledged out-of-the-box solution to fixed-column-width tables. These types of tables are often outputted by ghostscript or a tab separated file that went through some sort of unexpanded. Humans deal with this better because we can look at the table as a whole simply. This module makes this form of data infinitely more user-friendly. etc.

      Evan Carroll
      I hack for the ladies.
      www.EvanCarroll.com

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: New module -- Fixed width data solution
by runrig (Abbot) on May 05, 2008 at 23:38 UTC
    Two times now you have rated your own module 5 stars. It's been pointed out before that this is considered poor taste. The module may be useful, but Moose is an awful lot to load to parse fixed-width data.
      but Moose is an awful lot to load to parse fixed-width data.

      Load time for Moose has been going down lately, so that complaint is starting to become more FUD then Fact (a few lines of XS recently made Moose 45% less slow at startup). Also I would put it that when parsing large data files, your applications performance will be completely IO-bound and the extra couple hundred milliseconds that Moose requires at startup (assuming the class is made immutable) will not be an issue.

      -stvn
        Load time for Moose has been going down lately...
        I wasn't specifically talking about load time, but point taken. I went ahead and installed Moose to test this module, and although I'd like to try out Moose some more, I doubt I'll be using this module much, even though it's "The most significant module to ever grace cpan."
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: New module -- Fixed width data solution
by Jenda (Abbot) on May 05, 2008 at 16:10 UTC

    Too automated for my tastes. For example why do you think the column names never have spaces? What if the header with the column names spans several lines? (To accomodate for longer multiword column headings.) What about the line of dashes separating the headers from the data in quite a few fixed width reports? How do I get rid of page numbers and form feeds? How do you know whether the column names are left or right aligned?

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: New module -- Fixed width data solution
by jhourcle (Prior) on May 05, 2008 at 20:17 UTC

    I have to agree with Jenda -- although I could see use for this in some cases (although I don't have 5.10 installed, so can't test it right now), I often have need for:

    Less likely for a generic tool, but other issues I run into include:

    • Subheadings (so you have two column formats, and subheading data trickles down to the records under it)
    • processing pre-formatted HTML (you have to strip the HTML, but keep track of which column it was in so you can assign color / links to a consistent field)

    I currently use a script which can handle the first two cases (well, the second one, which handles the first one indirectly, and then passes the remainder to an unpack mask generator from BrowserUK, but modified to handle the third issue

      The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option.

      The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it.

      The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process.


      Evan Carroll
      I hack for the ladies.
      www.EvanCarroll.com
        The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option.

        Sorry -- I couldn't test the code (don't have 5.10, nor Moose installed), and didn't notice that you correctly handled that logic in your code.

        The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it.

        See BrowserUK's code that does just that -- it determines where in the data there are consistently spaces, and then uses that as the column boundaries. Obviously, that's not true in all cases, as I often have to process lines such as:

        ... where there are actually 7 columns, 3 of which are 1 character wide, and hold boolean information (it's actually a quality flag for the preceding column). The only good solution I've had to process this in an automated way (without directly counting columns) is to create some sort of an input mask, and parse that .. so I might generate something such as:

        ... where the only thing significant is that the character changes ... you could use specific characters to signal different data types (ie, is it handled as a string, numeric, boolean?). Of course, you'd need two characters for each one, so those times when two of the same field abut without whitespace between them.

        The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process.

        That's one approach -- but trust me in this -- of the type of data I process, this happens so often that I just want to pass in the number of lines of header/footers, or a regex to denote where to start/stop. If I have to go to the trouble of wrapping your code to handle this rather elementary task, I'm just not going to use it -- I'm going to use my own, as I don't see any real advantage otherwise -- it's not worth forcing people to update Perl and install Moose just to do this sort of work. If you're going to bill your module as ' The most significant module to ever grace cpan', I'd have expected a little bit more.

        (the parsers I'm writing are for people to parse scientific catalogs, to keep SQL databases synced up with the authoritative records, and I try to keep the necessary install to a minimum ... I don't even require DBD -- I generate CSV files and the necessary load routines for the database ... but that's also a performance tuning issue.

        For anyone who's going to actually stay through the end of the SPD / AGU joint session, I'm giving a talk late on Friday on the work, and although I'll touch on the issues with parsing, my bigger issue is assigning semantic meanings to the columns, so that catalogs can be cross-correlated in a meaningful way.)

Re: New module -- Fixed width data solution
by runrig (Abbot) on May 07, 2008 at 00:32 UTC
    It does not, e.g., seem to correctly parse the output of "ps -f" or "ps -l" (update: though I will give it credit for correctly parsing plain ol' "ps"). Probably because the column headers are sometimes right-justified and sometimes left-justified. And there doesn't seem to be any options to tweak the results to be correct.
      There are now such options... Version 0.05 will have two tests just for ps -lA. Essentially you provide ->heuristic(@lines). And, then you can $de->parse_hash($_) for @lines. Heuristic will run through BrowserUKs simply algo, then it will apply the resulting unpack string on the first row to get the columns, and index all calls to parse_hash by that data..

      Enjoy. Advice and suggestions and use cases are all appreciated.


      Evan Carroll
      I hack for the ladies.
      www.EvanCarroll.com
Re: New module -- Fixed width data solution
by apl (Monsignor) on May 07, 2008 at 18:50 UTC
    You seem to be incredibly sensitive to perceived criticism, so please don't take this the wrong way: What's the net advantage of using this module as opposed to doing an unpack?
      Not have learn packing :)