in reply to Re: Best way to store/access large dataset?
in thread Best way to store/access large dataset?

I think i'm butchering this trying to get it to work. Is this the right approach to get my data in?

use strict ; use warnings ; use Data::Dumper ; open my $data,"<","ID_file.txt" or die "NO ID FILE: $!"; open my $attrs,"<","Attribute_file.txt" or die "NO ATTR FILE: $!"; sub getdata { while(<$data>){ my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } } sub getattrs { while(<$attrs>){ my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } }

I know this isn't right, just not sure why.

Replies are listed 'Best First'.
Re^3: Best way to store/access large dataset?
by Veltro (Hermit) on Jun 22, 2018 at 20:30 UTC

    Try it with this, make sure that you remove the first two lines from your both your data files or it won't work:

    use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "ID_file.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attribute_file.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; # In my previous example I used # a counter which is not available # here, so that is why the first # value is 0 getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } print Dumper( $data ) ; print Dumper ( $attrs ) ;

      I can't quite visualize it, but what you're doing is assigning each file it's category name and carrying that forward right? And then it just counts up the "hits" in each category for each attribute.

      Just for general knowledge, on a 3/4 size data set, it takes approximately 16 minutes before the dumper starts printing to screen. That's where I was wondering if this was the type of thing that could be forked? Also, is $j an arbitrary variable, or is it special? And $i is a special variable right? I was hoping to shoehorn the attribute ID into the data structure in order to use it in an output at the end of this.

      This works:

      use strict ; use warnings ; use Data::Dumper ; open my $dataIn1, "<", "Attribute_ID.txt" or die "NO ID FILE: $!"; open my $dataIn2, "<", "Attributes.txt" or die "NO ATTR FILE: $!"; my $data = () ; my $attrs = () ; sub getdata { my ( $fileName, $type ) = split /\t/, $_[1] ; push @{$data}, $type unless !defined $fileName ; } sub getattrs { my @attrs = split /\t/, $_[1] ; shift @attrs ; push @{$attrs}, \@attrs unless !defined $attrs[0] ; } while( <$dataIn1> ) { chomp ; getdata( 0, $_ ) ; } while( <$dataIn2> ) { chomp ; getattrs( 0, $_ ) ; } my @result; for( my $j = 0 ; $j < @{$attrs} ; ++$j ) { my %subres ; @subres{@{$data}} = ( 0 ) x @{$attrs->[0]} ; for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { if ( $attrs->[$j][$i] == 1 ) { ++$subres{ $data->[$i]} ; } } ; push @result, \%subres ; } print Dumper(\@result) ;
        ...what you're doing is assigning each file it's category name and carrying that forward right?...

        I'm not really assigning anything. In your example each row from ID's corresponds to exactly one column in Attributes. So I used this to keep the code simple:

        1.file.ext Square --> corresponds to column 2 in Attributes
        2.file.ext Triangle --> corresponds to column 3 in Attributes
        ...
        16.file.et Square --> corresponds to column 17 in Attributes

        ...Also, is $j an arbitrary variable, or is it special? And $i is a special variable right?...

        There is nothing 'special' about $i and $j. They are just used to traverse the data array and multi dimensional attrs array. In this case I used $j to address each attribute set in attrs. I used $i to address each element in data and each individual attribute of data sub sets inside attrs

        ...I was hoping to shoehorn the attribute ID into the data structure in order to use it in an output at the end of this...

        To get the ID in the data set, you can make these changes. I'm just adding it to the final result set with the key 'ID' in this case. (Line number followed by: < = remove and > = add):

        18 < shift @attrs ; 35 > $subres{ID} = $attrs->[0] ; 36 < for( my $i = 0 ; $i < @{$attrs->[$j]} ; ++$i ) { 36 > for( my $i = 1 ; $i < @{$attrs->[$j]} ; ++$i ) { 38 < ++$subres{ $data->[$i]} ; 38 > ++$subres{ $data->[$i-1]} ;

        On line 18 the row ID was removed from the attribute set. So we no longer do that. That means that in the for loop we need to start at index 1 instead of 0 (Line 36). However, the indexing in data has not changed so we have to subtract 1 $i-1 (Line 38).

        ...If I get rid of the first line in the second file, I'll lose the file name associated with the binary...

        I'm not sure what you mean with this association. If it is the order of appearance inside data that changes? Then I suggest a small piece of code that alters that order based on column order inside Attributes

        ...Which would make them not be able to be grouped by category?...

        What needs to be grouped? Do you have examples?

        ...And it's probably also important to point out that the attribute numbers aren't arbitrary, they are defined....

        What do you mean with defined? In you example you show attributes that are binary, they are either 0 or 1? If there is something specific that needs to be done can you try to visualize that?

      If I get rid of the first line in the second file, I'll lose the file name associated with the binary. Which would make them not be able to be grouped by category? And it's probably also important to point out that the attribute numbers aren't arbitrary, they are defined. I can always sort my input file so they are listed in order which would be a workaround if I can't carry the numbers forward.