muyprofesional has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks!

I need to write a script for launch SQL's against one flat file with no-only-one-field-separator structure, and not only fixed-width separators, like this:

Jul 12 02:09:22 - TEST: user1, 15 - load | fail | 131

The file could reach 10Gb or moar of data. It must be able to define the regexp to estract the fields (reading from a config file, for example), something like this:

(.{15}) - (.*?): (.*?), (.*?) - (.*?) \| (.*?) \| (.*)

I've been testing somes approachs, but I'm not pretty sure about how to do it the best way (performance, first)

- Parsing and writing to a DBI Driven database (like this: http://perlmonks.com/?node_id=340569). It consumes too much disk space (re-creates a database with the data parsed, and then launch the query)

- Use DBD::RAM, load the database in memory while parsing the file, and then query it. I think this fails cause of in-memory data size. ¿Could be possible stablish some disk-buffer resource while it grows?..

- Use some TEXT::CSV_XS with DBD::CSV approach, or DBD::Anydata too. I think this is the way: not replicating the data, and let the driver launch the query an lead with the memory recourecs (i need to group and order resulsets). But, i haven't found any way to provide DBD::Anydata the regex or fields delimiters options to handle a different file format than CSV, INI, etc. I've done test with CSV files and the performance is tooo low.

I really need a ray on this, any help would be appreciated. Thanks a lot in advance,

Replies are listed 'Best First'.
Re: SQL Query a Non-CSV Flat File - Any Suggestions? :S
by Corion (Patriarch) on Jul 15, 2010 at 15:27 UTC

    As another alternative, consider DBD::SQLite and a :memory: database. I'm not sure whether querying DBD::AnyData will actually be fast - certainly, it won't create or use any index for your columns.

      I'll use a field as index, but would have duplicated data, and not allways will use it. I think DBD::SQlite with memory database could help my memory balancing. Thanks a lot!
Re: SQL Query a Non-CSV Flat File - Any Suggestions? :S
by TedPride (Priest) on Jul 15, 2010 at 19:18 UTC
    Depends on what you're trying to do with those SQL's, doesn't it? If all you're doing is selecting certain records based off values or value ranges, then it's a simple matter of going through the file line by line, pattern matching out the data into fields, and then matching against those fields. If on the other hand you need to add or remove records from anywhere except the very end of the file, then you have a problem. One easy solution would be to rent some cheap mySQL hosting, export your data over there, and query against it, rather than trying to do all the heavy lifting on your machine.
      I'll only do SELECTS, for simple ones (no groupping or ordering) u're right, just parsing, applying filer to fields and write the resulset. But I need groupping and ordering. I thing export Gbs of data to a host is impossible. Thanks!