comment on

The nature of unpack means it supports variable width final columns, they will show up as an empty string by default or undef if you enable the option.

Sorry -- I couldn't test the code (don't have 5.10, nor Moose installed), and didn't notice that you correctly handled that logic in your code.

The lack of headers is an issue, the biggest time saver of using the module is not having to count the chars for the unpack template by hand -- I might be able to make this work with heuristics if you send in the whole table though -- I'll have to look at it.

See BrowserUK's code that does just that -- it determines where in the data there are consistently spaces, and then uses that as the column boundaries. Obviously, that's not true in all cases, as I often have to process lines such as:

   4.8    5.2e+15    2.6e+30    306 
  -3.5    2.4e+15    3.2e+30    245 
 -18.2*   1.2e+15    2.2e+29     80 
   3.4*   1.1e+15    4.4e+29     61 
   2.6    -------    -------     47 
   1.8*   1.6e+15    1.3e+30    333 
   2.8*   2.4e+14    1.0e+29     73 
   0.8    1.4e+15    2.8e+30    240 
 -16.6*   1.8e+14    5.1e+28     81 
  -9.3    8.3e+14    1.0e+30     73 
  91.4*   1.6e+14*   7.6e+29*     8 
  16.4*   7.4e+14    1.1e+30    173 
   8.2*   8.1e+14*   9.2e+29*    99 
   0.1*   8.5e+14    2.4e+29     67 
 -10.2*   1.6e+14    1.9e+28    103 
  12.3*   1.0e+15    1.2e+29    264 
  -5.3*   5.4e+14    1.2e+29     96 
   9.6*   2.0e+14    1.3e+29    179 
  35.2*   1.1e+15    8.1e+29     94 
  54.7*   -------    -------    339 
  34.7    1.7e+15    2.4e+30    100 
  -6.6*   3.0e+14    9.3e+29    279 
  -2.0*   1.7e+14    4.5e+29    121 
  -5.7*   5.3e+14    1.1e+29     90 
   2.9*   1.6e+14    1.9e+28     92 
   8.4*   -------    -------     92
[download]

... where there are actually 7 columns, 3 of which are 1 character wide, and hold boolean information (it's actually a quality flag for the preceding column). The only good solution I've had to process this in an automated way (without directly counting columns) is to create some sort of an input mask, and parse that .. so I might generate something such as:

 #####|   #######|   #######|   ###
  16.4*   7.4e+14    1.1e+30    173 
   8.2*   8.1e+14*   9.2e+29*    99
[download]

... where the only thing significant is that the character changes ... you could use specific characters to signal different data types (ie, is it handled as a string, numeric, boolean?). Of course, you'd need two characters for each one, so those times when two of the same field abut without whitespace between them.

The stripping piece is all done by the user, we don't touch the file handle. You send to me the line to process.

That's one approach -- but trust me in this -- of the type of data I process, this happens so often that I just want to pass in the number of lines of header/footers, or a regex to denote where to start/stop. If I have to go to the trouble of wrapping your code to handle this rather elementary task, I'm just not going to use it -- I'm going to use my own, as I don't see any real advantage otherwise -- it's not worth forcing people to update Perl and install Moose just to do this sort of work. If you're going to bill your module as ' The most significant module to ever grace cpan', I'd have expected a little bit more.

(the parsers I'm writing are for people to parse scientific catalogs, to keep SQL databases synced up with the authoritative records, and I try to keep the necessary install to a minimum ... I don't even require DBD -- I generate CSV files and the necessary load routines for the database ... but that's also a performance tuning issue.

For anyone who's going to actually stay through the end of the SPD / AGU joint session, I'm giving a talk late on Friday on the work, and although I'll touch on the issues with parsing, my bigger issue is assigning semantic meanings to the columns, so that catalogs can be cross-correlated in a meaningful way.)

In reply to Re^3: New module -- Fixed width data solution by jhourcle
in thread New module -- Fixed width data solution by EvanCarroll

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.