Re: New module -- Fixed width data solution

I have to agree with Jenda -- although I could see use for this in some cases (although I don't have 5.10 installed, so can't test it right now), I often have need for:

Support for lack of headers (the default when dumping MS Access ... and I think Excel, too) ... possibly just give array but no hash access. (I've also had situations w/ incorrect headers ... two (four) columns were inserted between 'Accel' and 'MPA')
Stripping header / footer lines
Variable width final columns (possibly missing)

Less likely for a generic tool, but other issues I run into include:

Subheadings (so you have two column formats, and subheading data trickles down to the records under it)
processing pre-formatted HTML (you have to strip the HTML, but keep track of which column it was in so you can assign color / links to a consistent field)

I currently use a script which can handle the first two cases (well, the second one, which handles the first one indirectly, and then passes the remainder to an unpack mask generator from BrowserUK, but modified to handle the third issue

Comment on Re: New module -- Fixed width data solution

Replies are listed 'Best First'.
Re^2: New module -- Fixed width data solution by EvanCarroll (Chaplain) on May 05, 2008 at 22:34 UTC
The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option. The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it. The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process. Evan Carroll I hack for the ladies. www.EvanCarroll.com	[reply]
Re^3: New module -- Fixed width data solution by jhourcle (Prior) on May 06, 2008 at 02:20 UTC
The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option. Sorry -- I couldn't test the code (don't have 5.10, nor Moose installed), and didn't notice that you correctly handled that logic in your code. The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it. See BrowserUK's code that does just that -- it determines where in the data there are consistently spaces, and then uses that as the column boundaries. Obviously, that's not true in all cases, as I often have to process lines such as: Read more... (1163 Bytes) ... where there are actually 7 columns, 3 of which are 1 character wide, and hold boolean information (it's actually a quality flag for the preceding column). The only good solution I've had to process this in an automated way (without directly counting columns) is to create some sort of an input mask, and parse that .. so I might generate something such as: Read more... (311 Bytes) ... where the only thing significant is that the character changes ... you could use specific characters to signal different data types (ie, is it handled as a string, numeric, boolean?). Of course, you'd need two characters for each one, so those times when two of the same field abut without whitespace between them. The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process. That's one approach -- but trust me in this -- of the type of data I process, this happens so often that I just want to pass in the number of lines of header/footers, or a regex to denote where to start/stop. If I have to go to the trouble of wrapping your code to handle this rather elementary task, I'm just not going to use it -- I'm going to use my own, as I don't see any real advantage otherwise -- it's not worth forcing people to update Perl and install Moose just to do this sort of work. If you're going to bill your module as ' The most significant module to ever grace cpan', I'd have expected a little bit more. (the parsers I'm writing are for people to parse scientific catalogs, to keep SQL databases synced up with the authoritative records, and I try to keep the necessary install to a minimum ... I don't even require DBD -- I generate CSV files and the necessary load routines for the database ... but that's also a performance tuning issue. For anyone who's going to actually stay through the end of the SPD / AGU joint session, I'm giving a talk late on Friday on the work, and although I'll touch on the issues with parsing, my bigger issue is assigning semantic meanings to the columns, so that catalogs can be cross-correlated in a meaningful way.)	[reply] [d/l] [select]
Re^4: New module -- Fixed width data solution by EvanCarroll (Chaplain) on May 06, 2008 at 03:11 UTC
Thanks for giving me something to think about! I will update my module if I can think of a more elegant way to help you out here. update: v0.03 supports `heuristics => \@lines` which is essentially a rip off of BrowserUK's code, I just uped it for the first time last night. I'm still working on a solution for your */bool stuff. Evan Carroll I hack for the ladies. www.EvanCarroll.com	[reply] [d/l]