in reply to Re^6: Best way to store/access large dataset?
in thread Best way to store/access large dataset?

Hi Speed_Freak,

This line was not correct, sorry for that: $subres{ID} = $attrs->[0] ; and should have been $subres{ID} = $attrs->[$j][0] ;. But it doesn't matter so much because that line is changed again in the next code.

I will give you some suggestions commented in the code how to require the results you described, but you'll have do some programming there yourself:

use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attributes_ID.txt" or die "NO ID FILE: $!" ; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!" ; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[0] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[0] ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } sub calcPercentages { # INPUT: Hash reference # Determine the total amount of attributes # Walk through each category: Circle, Triangle, ... # Take the hit count divided by the total amount of attributes (mu +ltiplied by 100?) # For each category add something to the hash to store the percent +age # e.g. CircleChance, TriangleChance, .... # askQuestions could potentially be called here } sub askQuestions { # INPUT: Hash reference # my $h = ... # Question 1: Does this attribute occur in Circle more than 50% of + the time, and less than 10% of the time in Triangle # if ( $h->{ CircleChance } > 50 && $h->{ TriangleChance } < 10 ) +{ # Do something here. # E.g. Store another result $h # } } while( <$dataIn1> ) { chomp ; getdata( $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; my $id = $attrs->[$j][0] ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i-1]} ; } } ; # You could potentially start calculating hit count percentages pe +r category here: calcPercentages( \%subres ) ; push @result, { $id => \%subres } ; } print Dumper(\@result) ;

The results now look like this, but more work is needed to calculate the hit count percentages (as indicated in the code above):

$VAR1 = [ { '1' => { 'Circle' => 4, 'Rectangle' => 4, 'Square' => 4, 'Triangle' => 0 } }, { '2' => { 'Circle' => 4, 'Square' => 4, 'Rectangle' => 0, 'Triangle' => 0 } }, ];

Replies are listed 'Best First'.
Re^8: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 27, 2018 at 14:10 UTC

    Thank you! I'll work on this and see if I can make some progress.

    sub calcPercentages { # INPUT: Hash reference # Determine the total amount of attributes # Walk through each category: Circle, Triangle, ... # Take the hit count divided by the total amount of attributes (mu +ltiplied by 100?)

    For clarification, "# Take the hit count divided by the total amount of attributes (multiplied by 100?)" would actually be divide by the total number of files in each category. So if there are four files in the square category (four instances of square in this case) then the percentage would be the hit count for the current attribute divided by the number of possible hits for that attribute in that category. So if attribute 1 has 3 total hits across the 4 occurrences of square, the percentage would be 3/4, or .75.

      If the result you want is along these lines

      Attribute : 1
      Category :     Circle Rectangle    Square  Triangle         
      Sum      :          4         4         4         0
      Count    :          4         4         4         4
      Percent  :    100.00%   100.00%   100.00%     0.00%
      
      Attribute : 2
      Category :     Circle Rectangle    Square  Triangle         
      Sum      :          4         0         4         0
      Count    :          4         4         4         4
      Percent  :    100.00%     0.00%   100.00%     0.00%
      
      Attribute : 3
      Category :     Circle Rectangle    Square  Triangle         
      Sum      :          0         4         0         4
      Count    :          4         4         4         4
      Percent  :      0.00%   100.00%     0.00%   100.00%
      

      then try this

      #!/usr/bin/perl use strict; use warnings; #use Data::Dump 'pp'; my $t0 = time; # start # load categ look up my $fileID = 'Attributes_ID.txt'; open IN,'<',$fileID or die "$!"; my %id2categ = (); my $count = 0; while (<IN>){ chomp; next unless /^\d/; # skip junk #1.file.ext Square my ($id,$cat) = split /\s+/,$_; $id2categ{$id} = $cat; ++$count; } close IN; print "$fileID : $count records loaded\n"; #pp \%file2categ; # read header to get fileid for each column my $fileA = 'Attributes.txt'; open IN,'<',$fileA or die "$!"; chomp (my $line1 = <IN>); my @fileid = split /\s+/,$line1; # convert fileid to category my (undef,@col2categ) = map{ $id2categ{$_} }@fileid; #pp \@col2categ; # count no of cols for each categ once my %count=(); $count{$_} +=1 for @col2categ; #pp \%count; # process each attribute in turn my $PAGESIZE = 100_000 ; # show progress open OUT,'>','report.txt' or die "$!"; my $total = 0; $count = 0; while (<IN>){ chomp; next unless /^\d/; # skip junk my ($attr,@score) = split /\s+/,$_; # aggregate by category my %sum=(); for my $col (0..$#score){ my $categ = $col2categ[$col]; $sum{$categ} += $score[$col]; } #pp \%result; # calc pcent; my %pcent; my @category = sort keys %count; for (@category){ $pcent{$_} = sprintf "%9.2f%%",100*$sum{$_}/$count{$_} unless $cou +nt{$_}==0; $sum{$_} = sprintf "%10d",$sum{$_}; $count{$_} = sprintf "%10d",$count{$_}; } # output print OUT "\nAttribute : $attr\n"; print OUT join "","Category : ",map{ sprintf "%10s",$_} @category,"\ +n"; print OUT join "","Sum : ",@sum{@category},"\n"; print OUT join "","Count : ",@count{@category},"\n"; print OUT join "","Percent : ",@pcent{@category},"\n"; # progress monitor if (++$count >= $PAGESIZE ){ $total += $count; $count = 0; print "Processed $total records\n"; }; } close IN; $total += $count; my $dur = time-$t0; printf "%s records time = %s seconds\n",$total,$dur;
      poj

        Oh wow, thanks for that! I had to alter my real world input data format a bit to make this work, but it chews through 183 datasets in 103 seconds.

        My file names don't all start with numbers, and are composed of a few different naming formats, including a few with pesky spaces in the name. So I just gave them arbitrary numbers from 1-183, and that solved my issues.

        Now the fun begins of doing all the follow on work to find the unique attributes!

        Hey poj, I am finally getting back to this and had some questions.
        When it comes to the Count portion that sums up the number of items in each category, could this be completed once and printed once in the output instead of done and printed for each attribute?
        Since the item counts are static, doing those counts for every attribute isn't necessary, and printing them adds a lot to the output.

        Also, I can't visualize the data structure in my head to continue on to the next part of the script.
        I would like to fork on each attribute, and evaluate the percentages that were calculated.

        #FIND UNIQUE ATTRIBUTES use Parallel::ForkManager; my $keycat = square my $specialcat = triangle my $pfm = new Parallel::ForkManager(40); $pfm->run_on_finish( sub { my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_stru +cture_reference) = @_; my $data = ${$data_structure_reference}; #foreach loop here my $pid = $pfm->start and next; #Next steps: #Need to find unique attributes for each category. #Foreach attribute, Foreach Category, If Pcent Category > .5 AND <.1 F +oreach Pcent Category that is not first category, then #Attribute=TRU +E or 1. (Every category needs to be evaluated for each attribute.) #**For an attribute to be unique it should exist in a category greater + than or equal to 50% of the time, but not exist in other #categories + more than 10% of the time, or #in the key category ($keycat) greater than 1/n (number of key items i +n key category), and not in the specialized category ($specialcat) gr +eater than 10% #of the time (Based on input for specialized category) #****Key category is static. It will have a label that doesn't change. + The specialized category will be defined before the script #runs. (i +e. square, or circle, or triangle...etc.) #Create counts for how many "hits" exist per sample, per category in a +ttributes identified as unique. (eg. If 200 attributes are #found to +be unique for the square category, but an item may only have a portio +n of those attributes present.) #calculate average amount of hits per category from unique attribute g +roup. (Take the average hits from all of the members of a #category f +rom the calculation above.) #store values for future use in follow on script. $pfm->finish( 0, { data => \$data, #} #$pfm->wait_all_children;

        The ultimate output would be a list by category of attributes that are found to be unique to that category, and the average number of the total times those attributes are present in the items found in that category.

      So I added the following block to count up the instances of each category

      my %categories = (); while ( my $name = <$dataIn1>) { chomp($name); my ($file, $category) = split(/\t/, $name); if (exists($categories{$category})) { $categories{$category} += 1; } else { $categories{$category} = 1; } } foreach my $category (sort { $categories{$b} <=> $categories{$a} } key +s %categories) { printf "%s: %s\n", $category, $categories{$category}; }

      My variable naming is terrible because I picked that up from stack overflow and adjusted it to fit. But my question then becomes how do I pluck data from this data structure, and the %subres structure? Most of my limited playing has been with arrays, so I'm struggling when it comes to this.

      I know this isn't anywhere near correct, but could this be considered heading in the right direction?

      something like: my $score = $subres->[2] Foreach $category in $categories {where %subres [0] = $category} ($sco +re = ($subres [1] / ($categories{$category}))

      If you can see through the non-existent formatting, my intent is to match each category from the categories list to it's corresponding category in the attribute results list, and then divide the result value by the category count value, and push that answer into the list.