in reply to Best way to store/access large dataset?

This is essentially about a data structure to hold the data you need.

Here is one way to do it .. I was not clear on what analysis you wanted so it is a simple sum:

use strict; use warnings; my %info; open my $id_file,"<","data01.txt" or die "NO ID FILE: $!"; while(<$id_file>){ chomp; my ($fileext,$name) = split; my ($column) = $fileext=~m/^(\d+)/ or next; $info{$fileext}{NAME} = $name; $info{$fileext}{COLUMN} = $column; $info{$fileext}{ATT_COUNT} = 0; } close $id_file; my @column_att; for (keys %info) { $column_att[ $info{$_}{COLUMN} ] = \$info{$_}{ATT_COUNT}; # For ef +ficiency } open my $attr_file,"<","data02.txt" or die "NO ATTR FILE: $!"; while (<$attr_file>){ chomp; next unless m/^\d+\s+[01\s]+$/; my @atts = split; for (my $i=1; $i<=$#atts;$i++){ ${$column_att[$i]} += $atts[$i]; } #$. < 4 and print "$. --- \n", Print_info(); } close $attr_file; Print_info(); exit 0; #-------------------------- sub Print_info{ for (sort {$info{$a}{COLUMN} <=> $info{$b}{COLUMN}} keys %info){ print "$_ \t",$info{$_}{ATT_COUNT}, " \t $info{$_}{NAME}\n"; } }
OUTPUT:
1.file.ext 10 Square 2.file.ext 12 Triangle 3.file.ext 15 Circle 4.file.ext 12 Square 5.file.ext 12 Triangle 6.file.ext 15 Circle 7.file.ext 15 Circle 8.file.ext 17 Rectangle 9.file.ext 17 Rectangle 10.file.ext 15 Circle 11.file.ext 12 Triangle 12.file.ext 12 Triangle 13.file.ext 12 Square 14.file.ext 17 Rectangle 15.file.ext 17 Rectangle 16.file.et 12 Square

                Memory fault   --   brain fried

Replies are listed 'Best First'.
Re^2: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 14:18 UTC

    Thanks for the response! I'm currently playing with your code trying to get it to work on my dataset. Currently it's just returning to the prompt after about 2.5 minutes without displaying anything.(I vaguely remember something about there being an issue creating a text file in windows and then trying to read it in while working in linux?)

    Once I figure out what I'm doing wrong, I'm going to attempt modifying it to create individual totals for each attribute by category. So the end output would be a list of each attribute in the first column, then each category would be listed across the top, then the totals for each attribute in each category would fill in the table.

    Like so:

    #Table Square Circle Triangle Rectangle 1 4 4 0 4 2 4 4 0 0 3 0 0 4 4 4 0 4 4 4 5 0 0 4 0 6 0 0 0 4 7 0 4 0 4 8 4 4 0 4 9 0 0 0 4 10 0 0 4 0 11 0 0 4 4 12 4 4 4 0 13 0 4 0 0 14 0 4 0 4 15 0 4 0 0 16 4 0 0 0 17 4 0 0 0 18 0 4 0 0 19 4 4 4 4 20 0 4 4 4 21 0 0 0 4 22 4 4 4 4 23 4 4 4 4 24 0 0 0 0 25 0 0 0 0 26 4 4 4 0 27 0 0 4 0 28 3 0 0 4 29 0 0 0 4 30 3 0 0 4

    The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) But with the database connections in mind, it seems like using threads to speed this up would not be recommended. So do you see a way to fork this? Or would forking not help in this case? I think I read that forking will chew up some more memory, but I think I can handle that overhead. (I have 20 cores/40 threads and 192GB ram to work with.)

      The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.)

      I have to ask:

      "ultimate goal is pulling data from a database"? Then why were you talking about these .txt files in the OP?

      "creating the binaries"? What are "binaries"?

      Why pull data from a database to do "calculations" (apparently external from the db) when a database can do efficient calculations for you?

        I missed this response, but I think I've answered the questions throughout the post. But if not I'll give it a shot now.

        The database doesn't exist yet, and I need to do the work as a proof of concept. So once the database exists, the script will be changed to point there instead of the files.
        Binaries are just a presence/absence representation of an attribute. They are calculated from a series of raw values by evaluating the relationships of those values a few different ways.
        I'm all for the database doing the calculations if it can. I'm in completely unfamiliar territory here, so recommendations are appreciated.

      The problem lies in my actual file names and the way the column variable is assigned. I have a couple types of file formats unfortunately...

      Type 1 = combinationoftextnumbersandcaharacter.fileextension Type 2 = combinationoftextnumbersandcaharacter.combinationoftextnumber +sandcaharacter.fileextension

      In either case, only the first block is needed. The second block in Type 2 can be ignored as well as the file extension for both.
      I'm going to look at regular expressions and try to make that work.

        I was able to read in the ID file by doing the following:

        my @split_names = split(/\./,$fileext); my ($column) = $split_names[0];

        But that only creates problems in the follow on summation block.