Re: Best way to store/access large dataset?

This is essentially about a data structure to hold the data you need.

Here is one way to do it .. I was not clear on what analysis you wanted so it is a simple sum:

use strict;
use warnings;

my %info;
open my $id_file,"<","data01.txt" or die "NO ID FILE: $!";

while(<$id_file>){
    chomp;
    my ($fileext,$name) = split;
    my ($column) = $fileext=~m/^(\d+)/ or next;
    $info{$fileext}{NAME}      = $name;
    $info{$fileext}{COLUMN}    = $column;
    $info{$fileext}{ATT_COUNT} = 0;
}
close $id_file;

my @column_att;
for (keys %info) {
    $column_att[ $info{$_}{COLUMN} ] = \$info{$_}{ATT_COUNT}; # For ef
+ficiency
}
open my $attr_file,"<","data02.txt" or die "NO ATTR FILE: $!";

while (<$attr_file>){
    chomp;
    next unless m/^\d+\s+[01\s]+$/;
    my @atts = split;
    for (my $i=1; $i<=$#atts;$i++){
        ${$column_att[$i]} += $atts[$i];
    }
    #$. < 4 and print "$. --- \n", Print_info();
}
close $attr_file;

Print_info();
exit 0;
#--------------------------
sub Print_info{
    for (sort {$info{$a}{COLUMN} <=> $info{$b}{COLUMN}} keys %info){
        print "$_  \t",$info{$_}{ATT_COUNT},
              "   \t $info{$_}{NAME}\n";
    }
}
[download]

OUTPUT:

1.file.ext      10        Square
2.file.ext      12        Triangle
3.file.ext      15        Circle
4.file.ext      12        Square
5.file.ext      12        Triangle
6.file.ext      15        Circle
7.file.ext      15        Circle
8.file.ext      17        Rectangle
9.file.ext      17        Rectangle
10.file.ext      15        Circle
11.file.ext      12        Triangle
12.file.ext      12        Triangle
13.file.ext      12        Square
14.file.ext      17        Rectangle
15.file.ext      17        Rectangle
16.file.et      12        Square
[download]

Memory fault -- brain fried

Comment on Re: Best way to store/access large dataset? Select or Download Code

Replies are listed 'Best First'.
Re^2: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 22, 2018 at 14:18 UTC
Thanks for the response! I'm currently playing with your code trying to get it to work on my dataset. Currently it's just returning to the prompt after about 2.5 minutes without displaying anything.(I vaguely remember something about there being an issue creating a text file in windows and then trying to read it in while working in linux?) Once I figure out what I'm doing wrong, I'm going to attempt modifying it to create individual totals for each attribute by category. So the end output would be a list of each attribute in the first column, then each category would be listed across the top, then the totals for each attribute in each category would fill in the table. Like so: `#Table Square Circle Triangle Rectangle 1 4 4 0 4 2 4 4 0 0 3 0 0 4 4 4 0 4 4 4 5 0 0 4 0 6 0 0 0 4 7 0 4 0 4 8 4 4 0 4 9 0 0 0 4 10 0 0 4 0 11 0 0 4 4 12 4 4 4 0 13 0 4 0 0 14 0 4 0 4 15 0 4 0 0 16 4 0 0 0 17 4 0 0 0 18 0 4 0 0 19 4 4 4 4 20 0 4 4 4 21 0 0 0 4 22 4 4 4 4 23 4 4 4 4 24 0 0 0 0 25 0 0 0 0 26 4 4 4 0 27 0 0 4 0 28 3 0 0 4 29 0 0 0 4 30 3 0 0 4` [download] The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) But with the database connections in mind, it seems like using threads to speed this up would not be recommended. So do you see a way to fork this? Or would forking not help in this case? I think I read that forking will chew up some more memory, but I think I can handle that overhead. (I have 20 cores/40 threads and 192GB ram to work with.)	[reply] [d/l]
Re^3: Best way to store/access large dataset? by erix (Prior) on Jun 22, 2018 at 19:53 UTC
The ultimate goal of this will to be pulling data from a database and creating the binaries on the fly through a series of calculations, and then using this script to determine the next series of data points to pull from the database. (This serves as a filter.) I have to ask: "ultimate goal is pulling data from a database"? Then why were you talking about these .txt files in the OP? "creating the binaries"? What are "binaries"? Why pull data from a database to do "calculations" (apparently external from the db) when a database can do efficient calculations for you?	[reply]
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 28, 2018 at 19:09 UTC
I missed this response, but I think I've answered the questions throughout the post. But if not I'll give it a shot now. The database doesn't exist yet, and I need to do the work as a proof of concept. So once the database exists, the script will be changed to point there instead of the files. Binaries are just a presence/absence representation of an attribute. They are calculated from a series of raw values by evaluating the relationships of those values a few different ways. I'm all for the database doing the calculations if it can. I'm in completely unfamiliar territory here, so recommendations are appreciated.	[reply]
Re^3: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 22, 2018 at 15:29 UTC
The problem lies in my actual file names and the way the column variable is assigned. I have a couple types of file formats unfortunately... `Type 1 = combinationoftextnumbersandcaharacter.fileextension Type 2 = combinationoftextnumbersandcaharacter.combinationoftextnumber +sandcaharacter.fileextension` [download] In either case, only the first block is needed. The second block in Type 2 can be ignored as well as the file extension for both. I'm going to look at regular expressions and try to make that work.	[reply] [d/l]
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 22, 2018 at 16:11 UTC
I was able to read in the ID file by doing the following: `my @split_names = split(/\./,$fileext); my ($column) = $split_names[0];` [download] But that only creates problems in the follow on summation block.	[reply] [d/l]