I've done similar stuff many times over, though from your description it seems that you've done it much more often than me :-). I can certainly relate to that feeling that the repetition is bothersome, but often not quite enough to attack the problem properly.

It seems to me that the only truly common code is "parse this datasource into a stream of records, where 'record' is a list of consistently sequenced fields corresponding to a table definition.

To you that's not much, but for others that's enough to start a new hype around "map/reduce". The parsing step is basically a "map", and the filtering and aggregation is a "reduce".

As for your actual problem:

Or if you have a database to put it in. ... what if all you have is a pair of files about 3gig each

Can't you get a developer machine with a few hundred gig of free disc space, and set up your own private database into which you can import such files? I mean, come on, 2x 3gig ain't that much. The import will take some time, but you said yourself that time isn't the problem.

Or maybe you want something like an SQL engine that works on in-memory objects? If yes, DBI::DBD::SqlEngine looks promising, though I've never used it before.


In reply to Re: The Eternal "filter.pl" by moritz
in thread The Eternal "filter.pl" by Voronich

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.