Re^9: Best way to store/access large dataset?

If the result you want is along these lines

Attribute : 1
Category :     Circle Rectangle    Square  Triangle         
Sum      :          4         4         4         0
Count    :          4         4         4         4
Percent  :    100.00%   100.00%   100.00%     0.00%

Attribute : 2
Category :     Circle Rectangle    Square  Triangle         
Sum      :          4         0         4         0
Count    :          4         4         4         4
Percent  :    100.00%     0.00%   100.00%     0.00%

Attribute : 3
Category :     Circle Rectangle    Square  Triangle         
Sum      :          0         4         0         4
Count    :          4         4         4         4
Percent  :      0.00%   100.00%     0.00%   100.00%

then try this

#!/usr/bin/perl
use strict;
use warnings;
#use Data::Dump 'pp';

my $t0 = time; # start

# load categ look up
my $fileID = 'Attributes_ID.txt'; 
open IN,'<',$fileID or die "$!";
my %id2categ = ();
my $count = 0;
while (<IN>){
  chomp;
  next unless /^\d/; # skip junk
  #1.file.ext    Square
  my ($id,$cat) = split /\s+/,$_;
  $id2categ{$id} = $cat;
  ++$count;
}
close IN;
print "$fileID : $count records loaded\n";
#pp \%file2categ;

# read header to get fileid for each column
my $fileA = 'Attributes.txt'; 
open IN,'<',$fileA or die "$!";
chomp (my $line1 = <IN>);
my @fileid = split /\s+/,$line1;

# convert fileid to category
my (undef,@col2categ) = map{ $id2categ{$_} }@fileid;
#pp \@col2categ;

# count no of cols for each categ once
my %count=();
$count{$_} +=1 for @col2categ;
#pp \%count;

# process each attribute in turn
my $PAGESIZE = 100_000 ; # show progress
open OUT,'>','report.txt' or die "$!";
my $total = 0;
$count = 0;
while (<IN>){
  chomp;
  next unless /^\d/; # skip junk

  my ($attr,@score) = split /\s+/,$_;
  
  # aggregate by category
  my %sum=();
  for my $col (0..$#score){
    my $categ = $col2categ[$col];
    $sum{$categ} += $score[$col];
  }
  #pp \%result;

  # calc pcent;
  my %pcent;
  my @category = sort keys %count;
  for (@category){
    $pcent{$_} = sprintf "%9.2f%%",100*$sum{$_}/$count{$_} unless $cou
+nt{$_}==0; 
    $sum{$_}   = sprintf "%10d",$sum{$_}; 
    $count{$_} = sprintf "%10d",$count{$_}; 
  }
  
  # output
  print OUT "\nAttribute : $attr\n";
  print OUT join "","Category : ",map{ sprintf "%10s",$_} @category,"\
+n";
  print OUT join "","Sum      : ",@sum{@category},"\n";
  print OUT join "","Count    : ",@count{@category},"\n";
  print OUT join "","Percent  : ",@pcent{@category},"\n";
  
  # progress monitor
  if (++$count >= $PAGESIZE ){
    $total += $count;
    $count = 0;
    print "Processed $total records\n";
  };
}
close IN;

$total += $count;
my $dur = time-$t0;
printf "%s records  time = %s seconds\n",$total,$dur;
[download]

poj

Comment on Re^9: Best way to store/access large dataset? Download Code

Replies are listed 'Best First'.
Re^10: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 28, 2018 at 19:53 UTC
Oh wow, thanks for that! I had to alter my real world input data format a bit to make this work, but it chews through 183 datasets in 103 seconds. My file names don't all start with numbers, and are composed of a few different naming formats, including a few with pesky spaces in the name. So I just gave them arbitrary numbers from 1-183, and that solved my issues. Now the fun begins of doing all the follow on work to find the unique attributes!	[reply]
Re^10: Best way to store/access large dataset? by Speed_Freak (Sexton) on Aug 02, 2018 at 14:57 UTC
Hey poj, I am finally getting back to this and had some questions. When it comes to the Count portion that sums up the number of items in each category, could this be completed once and printed once in the output instead of done and printed for each attribute? Since the item counts are static, doing those counts for every attribute isn't necessary, and printing them adds a lot to the output. Also, I can't visualize the data structure in my head to continue on to the next part of the script. I would like to fork on each attribute, and evaluate the percentages that were calculated. #FIND UNIQUE ATTRIBUTES use Parallel::ForkManager; my $keycat = square my $specialcat = triangle my $pfm = new Parallel::ForkManager(40); $pfm->run_on_finish( sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_stru +cture_reference) = @_; my $data = ${$data_structure_reference}; #foreach loop here my $pid = $pfm->start and next; #Next steps: #Need to find unique attributes for each category. #Foreach attribute, Foreach Category, If Pcent Category > .5 AND <.1 F +oreach Pcent Category that is not first category, then #Attribute=TRU +E or 1. (Every category needs to be evaluated for each attribute.) #For an attribute to be unique it should exist in a category greater + than or equal to 50% of the time, but not exist in other #categories + more than 10% of the time, or #in the key category ($keycat) greater than 1/n (number of key items i +n key category), and not in the specialized category ($specialcat) gr +eater than 10% #of the time (Based on input for specialized category) #**Key category is static. It will have a label that doesn't change. + The specialized category will be defined before the script #runs. (i +e. square, or circle, or triangle...etc.) #Create counts for how many "hits" exist per sample, per category in a +ttributes identified as unique. (eg. If 200 attributes are #found to +be unique for the square category, but an item may only have a portio +n of those attributes present.) #calculate average amount of hits per category from unique attribute g +roup. (Take the average hits from all of the members of a #category f +rom the calculation above.) #store values for future use in follow on script. $pfm->finish( 0, { data => \$data, #} #$pfm->wait_all_children; [download] The ultimate output would be a list by category of attributes that are found to be unique to that category, and the average number of the total times those attributes are present in the items found in that category.	[reply] [d/l]