Re: Best way to store/access large dataset?

I would suggest you try a two step approach.

1. Get your data structures as you want them

2. Process your data

For example build something like this first to get the data (I show you a rather simple script here, but I wonder if the database is not already coming from the database, why don't you poor it into the correct format already...)

use strict ;
use warnings ;
use Data::Dumper ;

my $data = () ;
my $attrs = () ;

sub getData {
    my ( $fileName, $type ) = split /\t/, $_[1] ;
    push @{$data}, $type unless !defined $fileName ;
}

sub getAttrs {
    my @attrs = split /\t/, $_[1] ;
    shift @attrs ;
    push @{$attrs}, \@attrs unless !defined $attrs[0] ;
}

# Gather data
my $context = 0 ;
my $counter = -1 ;
while(<DATA>) {
    chomp ;
    if ( $_ =~ /ID\'s/ ) {
        $context = 1 ;
        $counter = -1 ;
        next ;
    }
    if ( $_ =~ /Attributes/ ) {
        $context = 2 ;
        $counter = -1 ;
        next ;
    }
    if ( $context == 1 && $counter == -1 ) {
        ++$counter ;
        next ;
    } elsif ( $context == 1 && $counter > -1 ) {
        getData($counter, $_) ;
        ++$counter ;
    }
    if ( $context == 2 && $counter == -1 ) {
        ++$counter ;
        next ;
    } elsif ( $context == 2 && $counter > -1 ) {
        getAttrs($counter, $_) ;
        ++$counter ;
    }
} ;
foreach ( @{$data } ) {
    print $_ . " " ;
}
print "\n" ;
foreach ( @{$attrs->[0] } ) {
    print $_ . " " ;
} 
print "\n" ;
__DATA__
[download]

#ID's
File    ID
1.file.ext    Square
2.file.ext    Triangle
3.file.ext    Circle
4.file.ext    Square
5.file.ext    Triangle
6.file.ext    Circle
7.file.ext    Circle
8.file.ext    Rectangle
9.file.ext    Rectangle
10.file.ext    Circle
11.file.ext    Triangle
12.file.ext    Triangle
13.file.ext    Square
14.file.ext    Rectangle
15.file.ext    Rectangle
16.file.et    Square

#Attributes
attribute    1.file.ext    2.file.ext    3.file.ext    4.file.ext    5
+.file.ext    6.file.ext    7.file.ext    8.file.ext    9.file.ext    
+10.file.ext    11.file.ext    12.file.ext    13.file.ext    14.file.e
+xt    15.file.ext    16.file.et                
1    1    0    1    1    0    1    1    1    1    1    0    0    1    
+1    1    1                
2    1    0    1    1    0    1    1    0    0    1    0    0    1    
+0    0    1                
3    0    1    0    0    1    0    0    1    1    0    1    1    0    
+1    1    0                
4    0    1    1    0    1    1    1    1    1    1    1    1    0    
+1    1    0                
5    0    1    0    0    1    0    0    0    0    0    1    1    0    
+0    0    0                
6    0    0    0    0    0    0    0    1    1    0    0    0    0    
+1    1    0                
7    0    0    1    0    0    1    1    1    1    1    0    0    0    
+1    1    0                
8    1    0    1    1    0    1    1    1    1    1    0    0    1    
+1    1    1                
9    0    0    0    0    0    0    0    1    1    0    0    0    0    
+1    1    0                
10    0    1    0    0    1    0    0    0    0    0    1    1    0   
+ 0    0    0                
11    0    1    0    0    1    0    0    1    1    0    1    1    0   
+ 1    1    0                
12    1    1    1    1    1    1    1    0    0    1    1    1    1   
+ 0    0    1                
13    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
14    0    0    1    0    0    1    1    1    1    1    0    0    0   
+ 1    1    0                
15    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
16    1    0    0    1    0    0    0    0    0    0    0    0    1   
+ 0    0    1                
17    1    0    0    1    0    0    0    0    0    0    0    0    1   
+ 0    0    1                
18    0    0    1    0    0    1    1    0    0    1    0    0    0   
+ 0    0    0                
19    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
20    0    1    1    0    1    1    1    1    1    1    1    1    0   
+ 1    1    0                
21    0    0    0    0    0    0    0    1    1    0    0    0    0   
+ 1    1    0                
22    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
23    1    1    1    1    1    1    1    1    1    1    1    1    1   
+ 1    1    1                
24    0    0    0    0    0    0    0    0    0    0    0    0    0   
+ 0    0    0                
25    0    0    0    0    0    0    0    0    0    0    0    0    0   
+ 0    0    0                
26    1    1    1    1    1    1    1    0    0    1    1    1    1   
+ 0    0    1                
27    0    1    0    0    1    0    0    0    0    0    1    1    0   
+ 0    0    0                
28    0    0    0    1    0    0    0    1    1    0    0    0    1   
+ 1    1    1                
29    0    0    0    0    0    0    0    1    1    0    0    0    0   
+ 1    1    0                
30    0    0    0    1    0    0    0    1    1    0    0    0    1   
+ 1    1    1
[download]

Once you have collected your data then move on to your algorithm. In the following example I have reduced the amount of input data to reduce the output and I use hashes for their behavior. Further I don't know what you exactly want with the 25/75% thingy, but you can easily add another counter to this algorithm and count the times a 0 is encountered. I would work from there if you want some statistical calculation or something.

my @data = qw(Square Triangle Circle Square Triangle Circle Circle Rec
+tangle Rectangle Circle Triangle Triangle Square Rectangle Rectangle 
+Square) ;
$data = \@data ;
$attrs = [
    [1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1],
    [1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
] ;
my @result;
for( my $j = 0 ; $j < @{$attrs} ; ++$j ) {
    my %subres ;
    @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ;
    for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) {
        if ( $attrs->[$j][$i] == 1 ) {
            ++$subres{ $data->[$i]}  ; 
        }
    } ;
    push @result, \%subres ;
}
print Dumper(\@result) ;
[download]

The output is:

$VAR1 = [
          {
            'Square' => 4,
            'Circle' => 4,
            'Rectangle' => 4,
            'Triangle' => 0
          },
          {
            'Rectangle' => 0,
            'Triangle' => 0,
            'Circle' => 4,
            'Square' => 4
          }
        ];
[download]

Comment on Re: Best way to store/access large dataset? Select or Download Code

Replies are listed 'Best First'.
Re^2: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 22, 2018 at 19:21 UTC
I think i'm butchering this trying to get it to work. Is this the right approach to get my data in? `use strict ; use warnings ; use Data::Dumper ; open my $data,"<","ID_file.txt" or die "NO ID FILE: $!"; open my $attrs,"<","Attribute_file.txt" or die "NO ATTR FILE: $!"; sub getdata { while(<$data>){ my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } } sub getattrs { while(<$attrs>){ my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } }` [download] I know this isn't right, just not sure why.	[reply] [d/l]
Re^3: Best way to store/access large dataset? by Veltro (Hermit) on Jun 22, 2018 at 20:30 UTC
Try it with this, make sure that you remove the first two lines from your both your data files or it won't work: use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "ID_file.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attribute_file.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; # In my previous example I used # a counter which is not available # here, so that is why the first # value is 0 getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } print Dumper( $data ) ; print Dumper ( $attrs ) ; [download]	[reply] [d/l]
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 25, 2018 at 14:33 UTC
I can't quite visualize it, but what you're doing is assigning each file it's category name and carrying that forward right? And then it just counts up the "hits" in each category for each attribute. Just for general knowledge, on a 3/4 size data set, it takes approximately 16 minutes before the dumper starts printing to screen. That's where I was wondering if this was the type of thing that could be forked? Also, is $j an arbitrary variable, or is it special? And $i is a special variable right? I was hoping to shoehorn the attribute ID into the data structure in order to use it in an output at the end of this. This works: use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attribute_ID.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ; [download]	[reply] [d/l]
Re^5: Best way to store/access large dataset? by Veltro (Hermit) on Jun 26, 2018 at 08:45 UTC
Re^6: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 26, 2018 at 23:03 UTC
Some notes below your chosen depth have not been shown here
Re^4: Best way to store/access large dataset? by Speed_Freak (Sexton) on Jun 25, 2018 at 13:10 UTC
If I get rid of the first line in the second file, I'll lose the file name associated with the binary. Which would make them not be able to be grouped by category? And it's probably also important to point out that the attribute numbers aren't arbitrary, they are defined. I can always sort my input file so they are listed in order which would be a workaround if I can't carry the numbers forward.	[reply]