storing a file in 2d array

shabird has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: storing a file in 2d array by hippo (Archbishop) on May 01, 2020 at 08:36 UTC
If I had a TSV file to read into an AoA I would let a module do the heavy work: `#!/usr/bin/env perl use strict; use warnings; use Text::CSV_XS 'csv'; use feature 'say'; my $aoa = csv (in => 'gene.txt', sep => "\t"); my $header = shift @$aoa; say join '; ', @$_ for @$aoa;` [download]	[reply] [d/l]
Re: storing a file in 2d array by bliako (Abbot) on May 03, 2020 at 08:49 UTC
There may be a Perl interface for fetching and processing your data, if it's from a well known source, e.g. Ensembl, Entrez, KEGG, *Prot, etc. For example: http://ensemblgenomes.org/info/access/eg_api Bio::DB::SwissProt Bio::KEGG Bio::DB::EMBL Bio::DB::Taxonomy The usage can be as simple as this: `use Bio::DB::Taxonomy; my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); # use NCBI Entrez over HTTP my $taxonid = $db->get_taxonid('Homo sapiens'); # get a taxon my $taxon = $db->get_taxon(-taxonid => $taxonid);` [download] bw, bliako	[reply] [d/l]
Re: storing a file in 2d array by kcott (Archbishop) on May 02, 2020 at 16:56 UTC
G'day shabird, I notice most, if not all, of your posts relate to biological data which, as I'm sure you're aware, can be huge (often measured in gigabytes). I have also noticed that, in many cases, you've read entire file contents into a variable and then subsequently processed that variable's data; e.g. `my @content = (<FH>);` [download] I would recommend you look for ways to process the data as you read it from your input file. This will be more efficient and will use substantially less memory. It's not always possible to do this but in many cases it is. Where you can't do this, consider only storing a subset of the input data: you often won't need every piece of information for the task at hand. For your current task, I would recommend Text::CSV for reading the input; if you also have Text::CSV_XS installed it will run faster. I've included two ways to do this: one with the 2D array you say you want; and one without that intermediary data structure (as I discussed above). You've described the first part of your task well; however, the second part, with the counts, is a little sketchy. I've made two guesses regarding the counting: I don't know if either is what you want but you may, at least, get some ideas from them. I copied your sample input from the [download] link (thanks for providing that). As I see some discussion, in a number of responses, regarding whether tabs are correctly represented, I've added `&show_verbatim_input` so you can see exactly what I'm working with. Here's the code: #!/usr/bin/env perl use strict; use warnings; use autodie ':all'; use Data::Dump; use Text::CSV; { my $source_file = 'pm_11116298_gene.txt'; show_verbatim_input($source_file); process_without_2d_array($source_file); my $data_2d_ref = process_with_2d_array($source_file); # Do more processing with $data_2d_ref } sub process_without_2d_array { my ($file) = @_; print "\n\n+++++ WITHOUT INTERMEDIATE 2D ARRAY +++++\n"; my @proteins; my %count_of = (just_mfs => {}, all_mf_elements => {}); { open my $fh, '<', $file; { my $header_record_to_discard = <$fh>; } my $csv = Text::CSV::->new({sep => "\t"}); print "\n* Wanted Data Output \n"; while (my $row = $csv->getline($fh)) { push @proteins, $row->[0]; $count_of{just_mfs}{$row->[0]} = $#$row; $count_of{all_mf_elements}{$row->[0]} += scalar map split, @$row[1..$#$row]; print join('; ', @$row), "\n"; } } print "\n Wanted Row Counts (GUESS 1) \n"; print "$_ : $count_of{just_mfs}{$_}\n" for @proteins; print "\n Wanted Row Counts (GUESS 2) \n"; print "$_ : $count_of{all_mf_elements}{$_}\n" for @proteins; return; } sub process_with_2d_array { my ($file) = @_; print "\n\n+++++ WITH INTERMEDIATE 2D ARRAY +++++\n"; my @data_2d; { open my $fh, '<', $file; { my $header_record_to_discard = <$fh>; } my $csv = Text::CSV::->new({sep => "\t"}); while (my $row = $csv->getline($fh)) { push @data_2d, $row; } } print "\n 2D Array of Data \n"; dd \@data_2d; print "\n Wanted Data Output \n"; print join('; ', @$_), "\n" for @data_2d; print "\n Wanted Row Counts (GUESS 1) \n"; print "$_->[0] : $#$_\n" for @data_2d; print "\n Wanted Row Counts (GUESS 2) \n"; print "$_->[0] : ", scalar(map split, @$_[1..$#$_]), "\n" for @data_2d; return \@data_2d; } sub show_verbatim_input { my ($file) = @_; print " Input File ($file) \n", " ('^I' = TAB; '\$' = NEWLINE)\n"; system qw{cat -vet}, $file; return; } [download] Here's the output: Input File (pm_11116298_gene.txt) * ('^I' = TAB; '$' = NEWLINE) ProteinName^IMF1^IMF2^IMF3$ GH1^IGrowth factor activity^IGrowth hormone receptor binding^IHormone +activity$ POMC^IG protein-coupled receptor binding^IHormone activity^ISignaling +receptor binding$ THRAP3^IATP binding Source^INuclear receptor transcription coactivator + activity^IPhosphoprotein binding$ +++++ WITHOUT INTERMEDIATE 2D ARRAY +++++ * Wanted Data Output * GH1; Growth factor activity; Growth hormone receptor binding; Hormone +activity POMC; G protein-coupled receptor binding; Hormone activity; Signaling +receptor binding THRAP3; ATP binding Source; Nuclear receptor transcription coactivator + activity; Phosphoprotein binding * Wanted Row Counts (GUESS 1) * GH1 : 3 POMC : 3 THRAP3 : 3 * Wanted Row Counts (GUESS 2) * GH1 : 9 POMC : 9 THRAP3 : 10 +++++ WITH INTERMEDIATE 2D ARRAY +++++ * 2D Array of Data * [ [ "GH1", "Growth factor activity", "Growth hormone receptor binding", "Hormone activity", ], [ "POMC", "G protein-coupled receptor binding", "Hormone activity", "Signaling receptor binding", ], [ "THRAP3", "ATP binding Source", "Nuclear receptor transcription coactivator activity", "Phosphoprotein binding", ], ] * Wanted Data Output * GH1; Growth factor activity; Growth hormone receptor binding; Hormone +activity POMC; G protein-coupled receptor binding; Hormone activity; Signaling +receptor binding THRAP3; ATP binding Source; Nuclear receptor transcription coactivator + activity; Phosphoprotein binding * Wanted Row Counts (GUESS 1) * GH1 : 3 POMC : 3 THRAP3 : 3 * Wanted Row Counts (GUESS 2) * GH1 : 9 POMC : 9 THRAP3 : 10 [download] — Ken	[reply] [d/l] [select]
Re: storing a file in 2d array by jo37 (Curate) on May 01, 2020 at 08:30 UTC
Issues with your program: Your data is tab-separated, but you split on blanks. The newline is included in your array After solving these, the question remains what you want to achieve with `$sum`. You try to add the string contents of your array resulting in the warnings you see. Without knowing what kind of sum you want, I cannot help at this point. As far as you described the task, this would do: #!/usr/bin/perl use strict; use warnings; my @content = (<DATA>); my @myArray; for my $row (@content) { chomp $row; my @columns = split "\t", $row; push @myArray, \@columns; } my $title_row = shift @myArray; for my $row (@myArray) { print join('; ', @$row), "\n"; } __DATA__ ProteinName MF1 MF2 MF3 GH1 Growth factor activity Growth hormone receptor binding Ho +rmone activity POMC G protein-coupled receptor binding Hormone activity Sign +aling receptor binding THRAP3 ATP binding Source Nuclear receptor transcription coactiv +ator activity Phosphoprotein binding [download] Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re: storing a file in 2d array by AnomalousMonk (Archbishop) on May 01, 2020 at 09:57 UTC
I, also, am confused about what is supposed to be summed while processing the file (I can't see anything numeric in your sample data). (Update: I also don't understand what you want to do with a 2D array, or why.) However, this code will produce exactly the output you specify from the given input. (Caution: The tabs that are supposed to be in the `__DATA__` section may not survive the posting process. Check and restore them as needed.) use strict; use warnings; <DATA>; # ignore first input line while (my $line = <DATA>) { $line =~ s{ \t }{; }xmsg; # make output ProteinName field min 4 cols wide, right justified. $line =~ s{ \A ([^;]) (?= ;) }{ sprintf '%4s', $1 }xmse; print $line; } exit; __DATA__ ProteinName MF1 MF2 MF3 GH1 Growth factor activity Growth hormone receptor binding Ho +rmone activity POMC G protein-coupled receptor binding Hormone activity Sign +aling receptor binding THRAP3 ATP binding Source Nuclear receptor transcription coactiv +ator activity Phosphoprotein binding [download] Update:* I just round-tripped the code posted above and it looks like the tabs in the `__DATA__` section survived intact! Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: storing a file in 2d array by jo37 (Curate) on May 01, 2020 at 11:35 UTC
(Caution: The tabs that are supposed to be in the __DATA__ section may not survive the posting process. Check and restore them as needed.) Seems to depend on how you copy the text. When using the download link, everything looks fine. As I copied the data from the OP without altering anything, I didn't even think about a possible issue with tabs. Greetings, -jo `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply]
Re: storing a file in 2d array by rnewsham (Curate) on May 01, 2020 at 08:07 UTC
If your data is tab separated you should split on a \t. If you want your sum loop to count the number of elements in a row a better way would be `for ( @$row ) { $sum++; }` [download] If you want to print ';' separated you can just use a join. Although if you want this as input to some other program it may be safer to look at something like Text::CSV and set semicolon as the sep_char `print join( '; ', @$row );` [download] Couple of best practice notes; use warnings is preferred over -w and use 3 argument open. Putting it all together, not sure it is exactly what you want but should help get you there. `#!/usr/bin/perl use strict; use warnings; open my $fh, '<', "/tmp/Gene.txt" or die $!; my @content = (<$fh>); close($fh); my @myArray; for my $row (@content) { my @columns = split '\t', $row; push @myArray, \@columns; } my $title_row = shift @myArray; for my $row (@myArray) { my $sum = 0; for ( @$row ) { $sum++; } print "$row->[0] is $sum\n"; print join( '; ', @$row ); }` [download]	[reply] [d/l] [select]
Re^2: storing a file in 2d array by AnomalousMonk (Archbishop) on May 01, 2020 at 10:15 UTC
... count the number of elements in a row ... To accumulate a sum of the number of elements in a referenced array, a better way IMHO would be `$sum += @$row;` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: storing a file in 2d array by shabird (Sexton) on May 01, 2020 at 13:28 UTC
Works! and it does what i want thank you :)	[reply]
Re: storing a file in 2d array by johngg (Canon) on May 01, 2020 at 10:12 UTC
It might be that the fields are TAB separated but I see no evidence for that in the page source, just multiple SPACE characters. If that reflects reality then this code might do the trick. johngg@shiraz:~/perl/Monks$ perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<__EOD__ or die $!; ProteinName MF1 MF2 MF3 GH1 Growth factor activity Growth hormone receptor binding Ho +rmone activity POMC G protein-coupled receptor binding Hormone activity Sign +aling receptor binding THRAP3 ATP binding Source Nuclear receptor transcription coactiv +ator activity Phosphoprotein binding __EOD__ chomp( my @lines = grep { ! m{^ProteinName} } <$inFH> ); close $inFH or die $!; say join q{; }, split m{\s{2,}} for @lines;' GH1; Growth factor activity; Growth hormone receptor binding; Hormone +activity POMC; G protein-coupled receptor binding; Hormone activity; Signaling +receptor binding THRAP3; ATP binding Source; Nuclear receptor transcription coactivator + activity; Phosphoprotein binding [download] I hope this is helpful. Update: I should have looked at the download link, they are TABs, so changing `m{\s{2,}}` to `m{\t}` would work. Cheers, JohnGG	[reply] [d/l] [select]
Re: storing a file in 2d array by clueless newbie (Curate) on May 01, 2020 at 19:45 UTC
Let's make use of DBI and DBD::CSV. use strict; use warnings; use Data::Dumper; use DBI; use 5.01800; # Get a connect to the tables my $dbh=DBI->connect ("dbi:CSV:", undef, undef, { f_dir => "/Users/Desktop/", f_ext => ".txt/r", csv_sep_char => "\t", RaiseError => 1, }) or die "Cannot connect: $DBI::errstr"; eval { my $sth=$dbh->prepare( # We want a concatenation of ... from the file Gene.txt in + /Users/Desktop qq{select concat(ProteinName,'; ',MF1,'; ',MF2,'; ',MF3) a +s whatever from Gene} ); $sth->execute(); # get the names of the fields returned by the select my $field_aref=$sth->{NAME}; # and dump them ... #warn Data::Dumper->Dump([\$field_aref],[qw(field_aref)]),' '; # Get the selection one row at a time while (my $value_aref=$sth->fetchrow_arrayref()) { # dump the values from the select #warn Data::Dumper->Dump([\$value_aref],[qw(value_aref)]),' ' +; # For simplicity we will make a hash where the keys are the fi +eld names and the values are the values of those fields my %_h; @_h{@$field_aref}=@$value_aref; # dump the hash to confirm all is what we expect #warn Data::Dumper->Dump([\%_h],[qw(*_h)]),' '; # since everything looks reasonable ... say $_h{whatever}; }; }; $@ and die "SQL database error: $@"; [download] which (at least for me on Windows 10) yields `d:\scratch>perl 11116298.t GH1; Growth factor activity; Growth hormone receptor binding; Hormone +activity POMC; G protein-coupled receptor binding; Hormone activity; Signaling +receptor binding THRAP3; ATP binding Source; Nuclear receptor transcription coactivator + activity; Phosphoprotein binding` [download]	[reply] [d/l] [select]
Re: storing a file in 2d array by clueless newbie (Curate) on May 03, 2020 at 16:15 UTC
shabird has a number of posts (ead a file which has three columns and store the content in a hash, Query of multi dimentional array, storing a file in 2d array) that are somewhat similar. Hence "script.pl" which makes use of DBI, DBD::CSV, Getopt::Long::Descriptive, and Text::Table #!/usr/bin/env perl use strict; use warnings; use Carp; use Data::Dumper; $Data::Dumper::Deepcopy=1; $Data::Dumper::Indent=1; $Data::Dumper::Sortkeys=1; use DBI; use Getopt::Long::Descriptive('describe_options'); use Text::Table; use 5.01800; (our $opts,my $usage)=describe_options( $0.' %o <some-arg>', ,['directory\|d=s' ,'the working directory </Users/Desktop>' ,{ + default => '/Users/Desktop' }] ,['extension\|e=s' ,'the file extension <.txt/r>' , +{ default => '.txt/r' }] ,['separator\|s=s' ,'the separating character<"\t">' , +{ default => "\t" }] ,['sql=s', ,'the sql' , +{ required => 1 }] ,[] ,['verbose\|v' ,'print extra stuff' ] ,['help' ,'print usage message and exit' , +{ shortcircuit => 1 }] ); warn Data::Dumper->Dump([\$opts],[qw(opts)]),' ' if ($opts->{verbose} +); if ($opts->help()) { # MAN! MAN! say <<"_HELP_"; @{[$usage->text]} _HELP_ exit; } else { # No MAN required. }; # Get a connection to the database tables # ... as this is DBD::CSV a table is a file my $dbh=DBI->connect ("dbi:CSV:", undef, undef, { f_dir => $opts->directory(), f_ext => $opts->extension(), csv_sep_char => $opts->separator(), RaiseError => 1, }) or die "Cannot connect: $DBI::errstr"; eval { # Prepare and execute the sql my $sth=$dbh->prepare($opts->sql()); $sth->execute(); # get the names of the fields returned by the select my $field_aref=$sth->{NAME}; my $table=Text::Table->new(\'\|', map {( { title => $_ }, \'\|') } @{$field_aref} ) if ($#{$field_aref}); # and dump them ... warn Data::Dumper->Dump([\$field_aref],[qw(field_aref)]),' ' if ( +$opts->verbose()); # Get the selection one row at a time while (my $value_aref=$sth->fetchrow_arrayref()) { # dump the values from the select warn Data::Dumper->Dump([\$value_aref],[qw(value_aref)]),' ' +if ($opts->verbose()); # For simplicity we will make a hash where the keys are the fi +eld names and the values are the values of those fields my %_h; @_h{@$field_aref}=@$value_aref; # since everything looks reasonable ... if (defined &with_each_row) { # have a &with_each_row so ... with_each_row(\%_h); } elsif (@$field_aref > 1) { # select has multiple fields so ... # dump the hash to confirm all is what we expect warn Data::Dumper->Dump([\%_h],[qw(_h)]),' ' if ($opts->v +erbose()); $table->load($value_aref) if (defined $table); } else { # only one field say $value_aref->[0]; } } if (defined &in_summary) { in_summary(); } elsif (defined $table) { print $table->title(), $table->rule('-','\|'), $table->body(), $table->body_rule('-','-'); }; }; $@ and Carp::croak "SQL database error: $@"; __END__ [download] Yes, I'm guilty of heresy - I confess I'm on Windows. `perl script.pl --help script.pl [-desv] [long options...] <some-arg> -d STR --directory STR the working directory </Users/Desktop> -e STR --extension STR the file extension <.txt/r> -s STR --separator STR the separating character<"\t"> --sql STR the sql -v --verbose print extra stuff --help print usage message and exit` [download] Let us assume that we have stored the data from the nodes as x<node number>.txt in the local directory, we have "x11114659.txt", "x11115466.txt" and "x11116298.txt" so for ead a file which has three columns and store the content in a hash: `.>perl script.pl -d . --sql "select regulation from x11115466" up down NA up up up down down down up up down up NA NA up up` [download] or as a nice table (when the select returns more than one field ... we get a table) `.>perl script.pl -d . --sql "select genename, regulation from x1111546 +6" \|genename \|regulation\| \|----------\|----------\| \|APOL4 \|up \| \|CYP2C8 \|down \| \|NAALADL2 \|NA \| \|NANOS3 \|up \| \|C20orf204 \|up \| \|MIR429 \|up \| \|MIR200A \|down \| \|MIR200B \|down \| \|CFL1P4 \|down \| \|AC091607.1\|up \| \|RPL19P20 \|up \| \|SREK1IP1P1\|down \| \|CCT5P2 \|up \| \|CHTF8P1 \|NA \| \|FAR1P1 \|NA \| \|AC067940.1\|up \| \|AL662791.1\|up \| -----------------------` [download] For Query of multi dimentional array: `..>perl script.pl -d . --sql "select concat(GeneID,' ',(Tp1+tp2+tp3)) +from x11114659 order by GeneId" ALA1 33 THR8 168 HUA4 476 ABA5 17` [download] or again as table `..> perl script.pl -d . --sql "select GeneID, (Tp1+tp2+tp3) as sum fr +om x11114659 order by GeneId" \|GeneID\|sum\| \|------\|---\| \|ABA5 \| 17\| \|ALA1 \| 33\| \|HUA4 \|476\| \|THR8 \|168\| ------------` [download] And finally for storing a file in 2d array: `..>perl script.pl -d . --sql "select concat(ProteinName,'; ',MF1,'; ', +MF2,'; ',MF3) as whatever from x11116298" GH1; Growth factor activity; Growth hormone receptor binding; Hormone +activity POMC; G protein-coupled receptor binding; Hormone activity; Signaling +receptor binding THRAP3; ATP binding Source; Nuclear receptor transcription coactivator + activity; Phosphoprotein binding` [download] Now the count function doesn't seem to be behaving itself so this "select regulation, count() from x11115466 group by regulation" throws an error. But there's a simple work-around. Supply a module that exports two subs "with_each_row" and "in_summary" - "with_each_row" is fed the reference to a hash of field names and their values, and "in_summary" is called once the select is exhausted. `package Example; use strict; use warnings; use Exporter; our @ISA=qw(Exporter); our @EXPORT=qw(with_each_row in_summary); use Data::Dumper; use 5.01800; my %_H; sub with_each_row { my ($_HREF)=@_; warn Data::Dumper->Dump([\$_HREF],[qw(_HREF)]),' ' if ($main: +:opts->verbose()); $_H{$_HREF->{regulation}}++; }; sub in_summary { for my $key (sort keys %_H) { printf "%10s:%-10s\n",$key,$_H{$key}; }; }; 1;` [download] Fortunately, for us, there is no need to change any code in script.pl ... we simply make use ot the -M option and get `..>perl -MExample script.pl -d . --sql "select genename, regulation fr +om x11115466" NA:3 down:5 up:9` [download]	[reply] [d/l] [select]
Re: storing a file in 2d array by perlfan (Parson) on May 12, 2020 at 03:31 UTC
Like others I have noticed the subject in question, just wanted to point out Bio::Perl.	[reply]