in reply to Best way to store/access large dataset?
I would suggest you try a two step approach.
1. Get your data structures as you want them
2. Process your data
For example build something like this first to get the data (I show you a rather simple script here, but I wonder if the database is not already coming from the database, why don't you poor it into the correct format already...)
use strict ; use warnings ; use Data::Dumper ; my $data = () ; my $attrs = () ; sub getData { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getAttrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } # Gather data my $context = 0 ; my $counter = -1 ; while(<DATA>) { chomp ; if ( $_ =~ /ID\'s/ ) { $context = 1 ; $counter = -1 ; next ; } if ( $_ =~ /Attributes/ ) { $context = 2 ; $counter = -1 ; next ; } if ( $context == 1 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 1 && $counter > -1 ) { getData($counter, $_) ; ++$counter ; } if ( $context == 2 && $counter == -1 ) { ++$counter ; next ; } elsif ( $context == 2 && $counter > -1 ) { getAttrs($counter, $_) ; ++$counter ; } } ; foreach ( @{$data } ) { print $_ . " " ; } print "\n" ; foreach ( @{$attrs->[0] } ) { print $_ . " " ; } print "\n" ; __DATA__
#ID's File ID 1.file.ext Square 2.file.ext Triangle 3.file.ext Circle 4.file.ext Square 5.file.ext Triangle 6.file.ext Circle 7.file.ext Circle 8.file.ext Rectangle 9.file.ext Rectangle 10.file.ext Circle 11.file.ext Triangle 12.file.ext Triangle 13.file.ext Square 14.file.ext Rectangle 15.file.ext Rectangle 16.file.et Square #Attributes attribute 1.file.ext 2.file.ext 3.file.ext 4.file.ext 5 +.file.ext 6.file.ext 7.file.ext 8.file.ext 9.file.ext +10.file.ext 11.file.ext 12.file.ext 13.file.ext 14.file.e +xt 15.file.ext 16.file.et 1 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 2 1 0 1 1 0 1 1 0 0 1 0 0 1 +0 0 1 3 0 1 0 0 1 0 0 1 1 0 1 1 0 +1 1 0 4 0 1 1 0 1 1 1 1 1 1 1 1 0 +1 1 0 5 0 1 0 0 1 0 0 0 0 0 1 1 0 +0 0 0 6 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 7 0 0 1 0 0 1 1 1 1 1 0 0 0 +1 1 0 8 1 0 1 1 0 1 1 1 1 1 0 0 1 +1 1 1 9 0 0 0 0 0 0 0 1 1 0 0 0 0 +1 1 0 10 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 11 0 1 0 0 1 0 0 1 1 0 1 1 0 + 1 1 0 12 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 13 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 14 0 0 1 0 0 1 1 1 1 1 0 0 0 + 1 1 0 15 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 16 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 17 1 0 0 1 0 0 0 0 0 0 0 0 1 + 0 0 1 18 0 0 1 0 0 1 1 0 0 1 0 0 0 + 0 0 0 19 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 20 0 1 1 0 1 1 1 1 1 1 1 1 0 + 1 1 0 21 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 22 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 23 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 24 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 26 1 1 1 1 1 1 1 0 0 1 1 1 1 + 0 0 1 27 0 1 0 0 1 0 0 0 0 0 1 1 0 + 0 0 0 28 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1 29 0 0 0 0 0 0 0 1 1 0 0 0 0 + 1 1 0 30 0 0 0 1 0 0 0 1 1 0 0 0 1 + 1 1 1
Once you have collected your data then move on to your algorithm. In the following example I have reduced the amount of input data to reduce the output and I use hashes for their behavior. Further I don't know what you exactly want with the 25/75% thingy, but you can easily add another counter to this algorithm and count the times a 0 is encountered. I would work from there if you want some statistical calculation or something.
my @data = qw(Square Triangle Circle Square Triangle Circle Circle Rec +tangle Rectangle Circle Triangle Triangle Square Rectangle Rectangle +Square) ; $data = \@data ; $attrs = [ [1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1], [1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1], ] ; my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ;
The output is:
$VAR1 = [ { 'Square' => 4, 'Circle' => 4, 'Rectangle' => 4, 'Triangle' => 0 }, { 'Rectangle' => 0, 'Triangle' => 0, 'Circle' => 4, 'Square' => 4 } ];
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 19:21 UTC | |
by Veltro (Hermit) on Jun 22, 2018 at 20:30 UTC | |
by Speed_Freak (Sexton) on Jun 25, 2018 at 14:33 UTC | |
by Veltro (Hermit) on Jun 26, 2018 at 08:45 UTC | |
by Speed_Freak (Sexton) on Jun 26, 2018 at 23:03 UTC | |
| |
by Speed_Freak (Sexton) on Jun 25, 2018 at 13:10 UTC |