in reply to Best way to store/access large dataset?

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re^2: Best way to store/access large dataset?
by stonecolddevin (Parson) on Jun 22, 2018 at 17:41 UTC

    SQLite is the last thing that should come to mind when dealing with large datasets. I can't imagine why you would even think about using SQLite over Postgres, or depending on your needs, loading it up in to S3 and querying it using Athena, or Dynamodb. There's a wealth of technology out there made specifically for processing vast amounts of data and running calculations on it, for relatively cheap. SQLite is not one of them.

    Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

      I wish I knew more on this subject. If it makes a difference, there wont be any writing going on to the database during all of this. It is purely read. And from what little I've read, Postgres isn't recommended when you're looking for speed in purely read operations?.
      The database doesn't exist outside of its current test iteration, so there is still time to change it. But I wouldn't know what would be a better option.

        I'd be curious to see what you read that said Postgres wasn't recommended for lots of read operations. I don't think I've ever heard that before.

        If you're deeply invested in mariadb, it's probably fine. mysql has a lot of pitfalls, but people use it in large scale cases all the time.

        Regardless, my personal preference is Postgres. I don't think there would be an issues using it for high read volume or processing a large number of calculations, but it depends on what kind of traffic it's going to be taking. If it's a really specialized case, it's probably worth looking into some ETL (extract/transform/load) on AWS using EMR (Elastic MapReduce) and/or Athena.

        The key things here are how much data you're dealing with, how many calculations you need to perform, and how resource intensive those calculations are. I think Postgres will be just fine up to several million rows but if you're doing a ton of joining it might get hairy and be better to spread the work out a bit.

        Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

Re^2: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 15:20 UTC

    EDIT: I realized that your response was to the title of the thread. So I should clarify that I was leaning towards what was the best way to read in that database data into PERL to manipulate it. And if it would need to be stored in a file instead of memory.

    The database that is (will be soon) housing the data is MariaDB. And I think getting that data will be fairly easy. The interface will allow the user to select the items and categories of interest. Which then will trigger the script to create the attribute list by using a series of qualifiers in the SELECT statement. (I'm way oversimplifying, but the database isn't ready for me to even start figuring out how that's going to look.)

    That initial pull of data will be around 1.8 billion calculations if the qualifiers are relatively simple. The qualifiers are user definable, so they could range from simple greater than/less than, to various combinations of percentages of different values from the database.
    Following that comes this script which will ultimately perform an additional ~49 million calculations on the summary table to find the unique attributes.(A chain of greater than less than qualifiers based on attribute count and category count for each attribute in each category.)

    While a spreadsheet can indeed handle this second lift, it takes quite a while, and isn't automated. (All of my proof of concept work has been done in 64 bit excel, which takes about 45 minutes to apply all of the calculations.)

    I've had a colleague trying to tackle this in R as well, but he's having limited success due to the data size and R's memory usage. I know he is making headway, but it's not his primary task. And My limited knowledge of PERL is still 100 fold more than my non existent knowledge in R.

    I may be wrong, but I see the whole chain of scripts taking quite a bit of time, so I'm wanting to streamline as much as possible in anticipation of having users stack up query requests, with each request being unique.