Re: Trouble iterating through a hash
by 1nickt (Canon) on Mar 09, 2017 at 03:18 UTC
|
Hi e9292,
I rewrote your code without changing it too much so that it (a) works as specified and (b) gets rid of lots of unnecessary cruft, (c) primarily your use of use vars (deprecated) and global variables (discouraged).
However please note that you appear to be working with CSV-format files (tab-delimited), and so you would be much safer to read them in with a CSV-parsing module such as Text::CSV_XS which can handle empty fields, blank lines, special characters inside fields, etc.
Also, if you have 500,000 entries, I would suggest storing the output data in a simple RDBMS such as SQLite. This will make working with the data much easier after you collate it.
#!/usr/bin/perl
use warnings; use strict;
use autodie;
my %whole_pedigree;
open( my $in, '<', 'wholepedigree_F.txt');
while ( my $line = <$in> ) {
chomp $line;
my ( $ID, $sire, $dam, $F, $FB, $AHC, $FA ) = split ( /\s+/, $line
+ );
if ( $ID ) {
$whole_pedigree{ $ID } = "$F\t$AHC";
}
}
close $in;
open( my $look_for, '<', 'damF.txt' );
open( my $output, '>', 'output.txt' );
while ( my $line = <$look_for> ) {
chomp $line;
my ( $damID, $damF, $damAHC, $prog ) = split (/\s+/, $line);
if ( $prog && $whole_pedigree{ $prog } ) {
print $output join( "\t", $prog, $whole_pedigree{ $prog }, "$
+damID\t$damF\t$damAHC" ), "\n";
}
}
close $look_for;
close $output;
print "Done\n";
cat output.txt
3162 0 0 501093 0 0
3163 0 0 2958 0 0
3164 0 0 1895 0 0
3165 0 0 1382 0 0
3166 0 0 2869 0 0
Hope this helps!
The way forward always starts with a minimal test.
| [reply] [d/l] [select] |
|
|
open (...) or die "Couldn't open... $!";
Please add that to your post, so it can be truly exemplary for noobs.
...it is unhealthy to remain near things that are in the process of blowing up. man page for WARP, by Larry Wall
| [reply] [d/l] |
|
|
###### with auto die ######
open XXX, '<', "XXX";
#Can't open 'XXX' for reading: 'No such file or directory' at C:\Proje
+cts_Perl\testing\die_messasges.pl line 7
###### without auto die ###
# trailing \n in the die message suppresses the line number.
open XXX, '<', "XXX" or die "couldn't open Config file, XXX!\n";
# couldn't open Config file, XXX!
open XXX, '<', "XXX" or die "couldn't open XX file!";
#couldn't open XX file! at C:\Projects_Perl\testing\die_messasges.pl l
+ine 5.
open XXX, '<', "XXX" or die "couldn't open config file XX!, $!";
# couldn't open config file XX!, No such file or directory at C:\Proje
+cts_Perl\testing\die_messasges.pl line 6.
update:
I didn't show every possibility. Some points: 1)autodie is pretty cool, especially for short quick scripts. But, there is no context information. In a complex app, it may not be apparent to the user what this file is about. 2) Add or not the trailing "\n" to a die message to control reporting of Perl line number. 3) Often $! is just confusing noise depending upon the App. I use all of the above options in one situation or another. I can't say: "always do it way #X". | [reply] [d/l] [select] |
|
|
Re: Trouble iterating through a hash
by huck (Prior) on Mar 09, 2017 at 05:17 UTC
|
i agree with 1nickt that Text::CSV_XS could be easier, but you say, "this isnt a csv file, it has TABS. well Text::CSV_XS isnt about csv files as much as delimted files, and will take a tab delimiter just as easily.
to show you how easy it is to use Text::CSV_XS i rewrote your code using it. I fixed a few things, ya get yelled at here if you dont use 3 arg opens or lexical limited filehandles ($file1,$file2). i fixed the funny next if 1..$N==$.; to just plain next if ($.==1); and like 1nickt i assumed you meant $prog in the second loop rather than $ID. Another thing i did was follow your code rather than your specs, "and print columns 3 and 5 of whole_pedigree" but your code uses columns 4 and 6 instead.
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $hash1;
my $hash2;
use Text::CSV_XS;
my $sep="\t";
my $csv = Text::CSV_XS->new ({ sep_char => $sep });
open (my $file1, "<","wholepedigree_F.txt") or die "Couldn't open whol
+epedigree_F.txt \n";
while (my $row = $csv->getline ($file1)) {
next if ($.==1);
my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = @$row;
if ($ID){
$hash1 -> {$ID} -> {info1} = "$F\t$AHC";
}
}
close $file1;
my $Output=\*STDOUT;
open (my $file2, "<","damF.txt") or die "Couldn't open damF.txt\n";
while (my $row = $csv->getline ($file2)) {
next if ($.==1);
my ($damID, $damF, $damAHC, $prog) = @$row;
if ($prog){
$hash2 -> {$prog} -> {info2} = "$damID\t$damF\t$damAHC";
}
if ($prog && ($hash1->{$prog})) {
my $info1 = $hash1 -> {$prog} -> {info1};
my $info2 = $hash2 -> {$prog} -> {info2};
print "$prog\t$info1\t$info2\n";
}
}
close $file2;
print "Done";
There are many parts of 1nickt's code that made sense, like skipping $hash2 and the whole {info1} subtree, but i felt if you saw Text::CSV_XS used in a way more like your original code you may appreciate it more. It came with Activestate perl, so i suspect it is in core and doesnt need to be installed either.
I also didnt use the Text::CSV_XS output routines, so you could comare to your program easier. | [reply] [d/l] [select] |
|
|
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $hash1;
use Text::CSV_XS;
my $sep="\t";
my $csv = Text::CSV_XS->new ({ sep_char => $sep });
open (my $file1, "<","wholepedigree_F.txt") or die "Couldn't open whol
+epedigree_F.txt \n";
while (my $row = $csv->getline ($file1)) {
next if ($.==1);
my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = @$row;
if ($ID){
$hash1 -> {$ID}= [$F,$AHC];
}
}
close $file1;
my $Output=\*STDOUT;
open (my $file2, "<","damF.txt") or die "Couldn't open damF.txt\n";
while (my $row = $csv->getline ($file2)) {
next if ($.==1);
my ($damID, $damF, $damAHC, $prog) = @$row;
if ($prog && ($hash1->{$prog})) {
my $status = $csv->print($Output, [$prog,@{$hash1->{$prog}},$da
+mID,$damF,$damAHC]);
print "\n";
}
}
close $file2;
print "Done";
| [reply] [d/l] [select] |
Re: Trouble iterating through a hash
by Marshall (Canon) on Mar 09, 2017 at 04:20 UTC
|
I don't really understand what in FILE2 is supposed to match with what in FILE1? All non-zero data in FILE1 is in some column in FILE2. You do not show a line in FILE2 that would not "match" and you do not provide a sample "good" output.
I started re-writing your code, but could not proceed to a final solution due to the above. However this code below may help you.
Some comments to your code:
- "use vars" does not do what you think and there is absolutely no need for that here.
- I only see the need for a single hash of FILE1 - a second hash is not necessary.
- I have no idea about what you think this $N stuff does?
- There is no need for a multi-dimensional hash here and your syntax to address a simple hash is wrong (don't need the arrows).
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my %hash;
#$damID, $damF, $damAHC, $prog
my $damF = <<END;
501093 0 0 3162
2958 0 0 3163
1895 0 0 3164
1382 0 0 3165
2869 0 0 3166
END
#$ID, $sire, $dam, $F, $FB, $AHC, $FA
my $wholepedigree_F = <<END;
3162 2159 501093 0 0 0 0
3163 2960 2958 0 0 0 0
3164 2269 1895 0 0 0 0
3165 1393 1382 0 0 0 0
3166 2881 2869 0 0 0 0
END
open (FILE1, '<', \$wholepedigree_F) or die "Couldn't open wholepedigr
+ee_F.txt";
while (my $line = <FILE1>)
{
chomp $line;
next unless $line =~ /\S/; #skip blank lines
my ($ID, $F, $AHC) = (split (/\s+/, $line))[0,3,5];
$hash{$ID} = "$F\t$AHC";
}
close FILE1;
open (FILE2, '<', \$damF) or die "Can't open damF.txt";
#open (OUT, ">output.txt") or die "Can't Open output file";
#print OUT "\n";
while (my $line = <FILE2>)
{
chomp $line;
next unless $line =~ /\S/; #skip blank lines
my ($damID, $damF, $damAHC, $prog) = split (/\s+/, $line);
if ($hash{$prog})
{
my $info1 = $hash{$prog};
my $info2 = "$damID\t$damF\t$damAHC";
print "$prog\t$info1\t$info2\n";
}
}
#close OUT;
close FILE2;
print "Done";
| [reply] [d/l] |
|
|
I have no idea about what you think this $N stuff does?
I had a clue, er a guess, so i tested it. next if 1..$N==$.; skips the first line, kinda the hard way.
I agree with your other points, except that as written $info2 = $hash2 -> {$prog} -> {info2}; doesnt hurt either, it does that when the -> is assumed in $info2 = $hash2 -> {$prog}{info2}; anyway
| [reply] [d/l] [select] |
|
|
my $discard_first_line = <FILE1>;
before starting the while loop. The more cryptic albeit simple: <FILE1>; would work also. I prefer the extraneous "my" variable because it is very "cheap" and it is clear what it does. I would personally write a comment like <FILE1>; #discard first line anyway.
I agree with your other points, except that as written $info2 = $hash2 -> {$prog} -> {info2}; doesnt hurt either, it does that when the -> is assumed in $info2 = $hash2 -> {$prog}{info2}; anyway
My point is that there is no need at all for a 2nd dimension on the first hash, and no need for the 2nd hash at all! | [reply] [d/l] [select] |
Re: Trouble iterating through a hash -- oneliner explained
by Discipulus (Canon) on Mar 09, 2017 at 09:34 UTC
|
Hello e9292 and welcome to the monastery!,
First some sparse suggestion about you code style: use vars qw($ID $sire... here is unuseful: first use vars is deprecated and it really means our (see vars and use vars). you just need my ($ID, Ssire ... to declare lexical scoped variables.
Second you opening open (FILE1, "<wholepedigree_F.txt") or .. is oldish enough: always use the three arg open like:
open my $filehandle, '<', $path or ... and use a lexical filehandle istead of the bareword form. Also the low precedence or is there to avoid the necessity of the parens.
Now: you got good solutions and many wise suggestions, but if you say I want the output to match column 4 of damF.txt and wholepedigree_F.txt and print columns 3 and 5 of whole_pedigree and columns 1,2,3 of damF.txt
I'd answer with a oneliner (pay attention to MSWin32 doublequote)(PS if I understand correctly your needs as they are stated..)
perl -F"\s+" -lane "$sec?(push $h{$F[0]},@F[2,4]):($h{$F[3]}=[@F[0..2]
+]);$sec++ if eof;END{print map{qq($_ = @{$h{$_}}\n)}sort keys %h}"
+ mergedhash01.txt mergedhash02.txt
3162 = 501093 0 0 501093 0
3163 = 2958 0 0 2958 0
3164 = 1895 0 0 1895 0
3165 = 1382 0 0 1382 0
3166 = 2869 0 0 2869 0
See perlrun for -lane commandline options anf for -F too. The END is executed after the implicit while loop created by the -n switch.
In brief -a is autosplit mode and populate the special variable @F ('F' for fields, see perlvar for this). I have specified, with the -F option, that i want the current line splitted on \s+ instead of a single space that is the default.
Then -l take care for us of line endings (no need to chomp), -n put an implicit while loop around all the code that will be executed for every line of the files passed as arguments.
The $sec++ if eof is tricky: this part initialize and set to 1 the variable $sec (for second), ie when processing the first file and whe perl meet the end of file (see eof ) it set this switch like scalar to 1 (well the varible is set to 2 at the end of second file, but then we do not need it anymore).
Having this switch let us to know when we are processing the second file: infact the core of the oneliner is a IF ? THEN : ELSE ternary operator based on the value of the $sec variable: if false (we are processing the first file) we populate an hash entry $h{$F[3] with an anonymous array containing field 1,2 and 3 ( @F[0..2] ).
If $sec is true wea re processing the second file so we push in the yet created anonymous array fields 3 and 5 ( @F[2,4] ).
If you add -MO=Deparse before the other options you'll see the oneliner expanded a bit:
perl -MO=Deparse -F"\s+" -lane "$sec?(push $h{$F[0]},@F[2,4]):($h{$F[3
+]}=[@F[0..2]]);$sec++ if eof;END{print
map{qq($_ = @{$h{$_}}\n)}sort keys %h}" mergedhash01.txt mergedhash
+02.txt
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(@F) = split(/\s+/, $_, 0);
$sec ? push($h{$F[0]}, @F[2, 4]) : ($h{$F[3]} = [@F[0..2]]);
++$sec if eof;
sub END {
print map({"$_ = @{$h{$_};}\n";} sort(keys %h));
}
;
}
-e syntax OK
HtH and have fun!
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |