Re: storing a file in 2d array

G'day shabird,

I notice most, if not all, of your posts relate to biological data which, as I'm sure you're aware, can be huge (often measured in gigabytes). I have also noticed that, in many cases, you've read entire file contents into a variable and then subsequently processed that variable's data; e.g.

my @content = (<FH>);
[download]

I would recommend you look for ways to process the data as you read it from your input file. This will be more efficient and will use substantially less memory. It's not always possible to do this but in many cases it is. Where you can't do this, consider only storing a subset of the input data: you often won't need every piece of information for the task at hand.

For your current task, I would recommend Text::CSV for reading the input; if you also have Text::CSV_XS installed it will run faster.

I've included two ways to do this: one with the 2D array you say you want; and one without that intermediary data structure (as I discussed above).

You've described the first part of your task well; however, the second part, with the counts, is a little sketchy. I've made two guesses regarding the counting: I don't know if either is what you want but you may, at least, get some ideas from them.

I copied your sample input from the [download] link (thanks for providing that). As I see some discussion, in a number of responses, regarding whether tabs are correctly represented, I've added &show_verbatim_input so you can see exactly what I'm working with.

Here's the code:

#!/usr/bin/env perl

use strict;
use warnings;
use autodie ':all';

use Data::Dump;
use Text::CSV;

{
    my $source_file = 'pm_11116298_gene.txt';

    show_verbatim_input($source_file);
    process_without_2d_array($source_file);
    my $data_2d_ref = process_with_2d_array($source_file);
    # Do more processing with $data_2d_ref
}

sub process_without_2d_array {
    my ($file) = @_;

    print "\n\n+++++ WITHOUT INTERMEDIATE 2D ARRAY +++++\n";

    my @proteins;
    my %count_of = (just_mfs => {}, all_mf_elements => {});

    {
        open my $fh, '<', $file;
        { my $header_record_to_discard = <$fh>; }
        my $csv = Text::CSV::->new({sep => "\t"});

        print "\n*** Wanted Data Output ***\n";

        while (my $row = $csv->getline($fh)) {
            push @proteins, $row->[0];
            $count_of{just_mfs}{$row->[0]} = $#$row;
            $count_of{all_mf_elements}{$row->[0]}
                += scalar map split, @$row[1..$#$row];
            print join('; ', @$row), "\n";
        }
    }

    print "\n*** Wanted Row Counts (GUESS 1) ***\n";
    print "$_ : $count_of{just_mfs}{$_}\n" for @proteins;

    print "\n*** Wanted Row Counts (GUESS 2) ***\n";
    print "$_ : $count_of{all_mf_elements}{$_}\n" for @proteins;

    return;
}

sub process_with_2d_array {
    my ($file) = @_;

    print "\n\n+++++ WITH INTERMEDIATE 2D ARRAY +++++\n";

    my @data_2d;

    {
        open my $fh, '<', $file;
        { my $header_record_to_discard = <$fh>; }
        my $csv = Text::CSV::->new({sep => "\t"});

        while (my $row = $csv->getline($fh)) {
            push @data_2d, $row;
        }
    }

    print "\n*** 2D Array of Data ***\n";
    dd \@data_2d;

    print "\n*** Wanted Data Output ***\n";
    print join('; ', @$_), "\n" for @data_2d;

    print "\n*** Wanted Row Counts (GUESS 1) ***\n";
    print "$_->[0] : $#$_\n" for @data_2d;

    print "\n*** Wanted Row Counts (GUESS 2) ***\n";
    print "$_->[0] : ",
        scalar(map split, @$_[1..$#$_]), "\n" for @data_2d;

    return \@data_2d;
}

sub show_verbatim_input {
    my ($file) = @_;

    print "*** Input File ($file) ***\n",
          "        ('^I' = TAB; '\$' = NEWLINE)\n";
    system qw{cat -vet}, $file;

    return;
}
[download]

Here's the output:

*** Input File (pm_11116298_gene.txt) ***
        ('^I' = TAB; '$' = NEWLINE)
ProteinName^IMF1^IMF2^IMF3$
GH1^IGrowth factor activity^IGrowth hormone receptor binding^IHormone 
+activity$
POMC^IG protein-coupled receptor binding^IHormone activity^ISignaling 
+receptor binding$
THRAP3^IATP binding Source^INuclear receptor transcription coactivator
+ activity^IPhosphoprotein binding$


+++++ WITHOUT INTERMEDIATE 2D ARRAY +++++

*** Wanted Data Output ***
GH1; Growth factor activity; Growth hormone receptor binding; Hormone 
+activity
POMC; G protein-coupled receptor binding; Hormone activity; Signaling 
+receptor binding
THRAP3; ATP binding Source; Nuclear receptor transcription coactivator
+ activity; Phosphoprotein binding

*** Wanted Row Counts (GUESS 1) ***
GH1 : 3
POMC : 3
THRAP3 : 3

*** Wanted Row Counts (GUESS 2) ***
GH1 : 9
POMC : 9
THRAP3 : 10


+++++ WITH INTERMEDIATE 2D ARRAY +++++

*** 2D Array of Data ***
[
  [
    "GH1",
    "Growth factor activity",
    "Growth hormone receptor binding",
    "Hormone activity",
  ],
  [
    "POMC",
    "G protein-coupled receptor binding",
    "Hormone activity",
    "Signaling receptor binding",
  ],
  [
    "THRAP3",
    "ATP binding Source",
    "Nuclear receptor transcription coactivator activity",
    "Phosphoprotein binding",
  ],
]

*** Wanted Data Output ***
GH1; Growth factor activity; Growth hormone receptor binding; Hormone 
+activity
POMC; G protein-coupled receptor binding; Hormone activity; Signaling 
+receptor binding
THRAP3; ATP binding Source; Nuclear receptor transcription coactivator
+ activity; Phosphoprotein binding

*** Wanted Row Counts (GUESS 1) ***
GH1 : 3
POMC : 3
THRAP3 : 3

*** Wanted Row Counts (GUESS 2) ***
GH1 : 9
POMC : 9
THRAP3 : 10
[download]

— Ken

Comment on Re: storing a file in 2d array Select or Download Code