e9292 has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I am trying to merge common values between two files and print out the relevant information. I am using a code I have used many times to merge files, and for some reason it isn't working. There seems to be a problem with calling out the first hash in the second part of the script- but it should work with the global variables declared. The format of the input files are: damF.txt
501093 0 0 3162 2958 0 0 3163 1895 0 0 3164 1382 0 0 3165 2869 0 0 3166
wholepedigree_F.txt
3162 2159 501093 0 0 0 0 3163 2960 2958 0 0 0 0 3164 2269 1895 0 0 0 0 3165 1393 1382 0 0 0 0 3166 2881 2869 0 0 0 0
I want the output to match column 4 of damF.txt and wholepedigree_F.txt and print columns 3 and 5 of whole_pedigree and columns 1,2,3 of damF.txt. this is the code:
#!/usr/bin/perl use warnings; use strict; use diagnostics; use vars qw($ID $sire $dam $F $AHC $FB $FA $hash1 %hash1 $info1 $damID + $damF $damAHC $prog $hash2 %hash2 $info2); open (FILE1, "<wholepedigree_F.txt") or die "Couldn't open wholepedigr +ee_F.txt\n"; my $N = 1; while (<FILE1>){ chomp (my $line=$_); next if 1..$N==$.; my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = split (/\t/, $line); if ($ID){ $hash1 -> {$ID} -> {info1} = "$F\t$AHC"; } } close FILE1; open (FILE2, "<damF.txt") or die "Can't open damF.txt \n"; open (Output, ">output.txt") or die "Can't Open output file"; print Output "\n"; while (<FILE2>) { chomp (my $line=$_); next if 1..$N==$.; my ($damID, $damF, $damAHC, $prog) = split (/\t/, $line); if ($prog){ $hash2 -> {$prog} -> {info2} = "$damID\t$damF\t$damAHC"; } if ($prog && ($hash1->{$ID})) { $info1 = $hash1 -> {$ID} -> {info1}; $info2 = $hash2 -> {$prog} -> {info2}; print "$ID\t$info1\t$info2\n"; } } close Output; close FILE2; print "Done";
Please note that the input files have about 500000 entries in each column. Please help! Thanks in advance!

Replies are listed 'Best First'.
Re: Trouble iterating through a hash
by 1nickt (Canon) on Mar 09, 2017 at 03:18 UTC

    Hi e9292,

    I rewrote your code without changing it too much so that it (a) works as specified and (b) gets rid of lots of unnecessary cruft, (c) primarily your use of use vars (deprecated) and global variables (discouraged).

    However please note that you appear to be working with CSV-format files (tab-delimited), and so you would be much safer to read them in with a CSV-parsing module such as Text::CSV_XS which can handle empty fields, blank lines, special characters inside fields, etc.

    Also, if you have 500,000 entries, I would suggest storing the output data in a simple RDBMS such as SQLite. This will make working with the data much easier after you collate it.

    #!/usr/bin/perl use warnings; use strict; use autodie; my %whole_pedigree; open( my $in, '<', 'wholepedigree_F.txt'); while ( my $line = <$in> ) { chomp $line; my ( $ID, $sire, $dam, $F, $FB, $AHC, $FA ) = split ( /\s+/, $line + ); if ( $ID ) { $whole_pedigree{ $ID } = "$F\t$AHC"; } } close $in; open( my $look_for, '<', 'damF.txt' ); open( my $output, '>', 'output.txt' ); while ( my $line = <$look_for> ) { chomp $line; my ( $damID, $damF, $damAHC, $prog ) = split (/\s+/, $line); if ( $prog && $whole_pedigree{ $prog } ) { print $output join( "\t", $prog, $whole_pedigree{ $prog }, "$ +damID\t$damF\t$damAHC" ), "\n"; } } close $look_for; close $output; print "Done\n";
    cat output.txt
    3162 0 0 501093 0 0 3163 0 0 2958 0 0 3164 0 0 1895 0 0 3165 0 0 1382 0 0 3166 0 0 2869 0 0

    Hope this helps!


    The way forward always starts with a minimal test.
      Nice and clean (++).

      The only hesitation I had was that the error checking in OP's "open" statements was not retained.

      open (...) or die "Couldn't open... $!";
      Please add that to your post, so it can be truly exemplary for noobs.

              ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

        Error checking is actually taken care of by "use autodie;". I personally don't like to do that because often a die message can explain to the user the context of what is happening in addition to the actual file that couldn't be opened, e.g. "can't open configuration file: $file_name". This sort of thing becomes more important if there is a GUI involved as opposed to a command line.

        Also other "tweaking" of the die message can be done. There is a difference between  die "xyzzy" and die "xyzzy\n" This controls whether or not the user gets the Perl line of code number. Sometimes, it can confuse users if too much info is given with terminology that they don't understand.

        ###### with auto die ###### open XXX, '<', "XXX"; #Can't open 'XXX' for reading: 'No such file or directory' at C:\Proje +cts_Perl\testing\die_messasges.pl line 7 ###### without auto die ### # trailing \n in the die message suppresses the line number. open XXX, '<', "XXX" or die "couldn't open Config file, XXX!\n"; # couldn't open Config file, XXX! open XXX, '<', "XXX" or die "couldn't open XX file!"; #couldn't open XX file! at C:\Projects_Perl\testing\die_messasges.pl l +ine 5. open XXX, '<', "XXX" or die "couldn't open config file XX!, $!"; # couldn't open config file XX!, No such file or directory at C:\Proje +cts_Perl\testing\die_messasges.pl line 6.
        update:
        I didn't show every possibility.
        Some points: 1)autodie is pretty cool, especially for short quick scripts. But, there is no context information. In a complex app, it may not be apparent to the user what this file is about. 2) Add or not the trailing "\n" to a die message to control reporting of Perl line number. 3) Often $! is just confusing noise depending upon the App. I use all of the above options in one situation or another. I can't say: "always do it way #X".
Re: Trouble iterating through a hash
by huck (Prior) on Mar 09, 2017 at 05:17 UTC

    i agree with 1nickt that Text::CSV_XS could be easier, but you say, "this isnt a csv file, it has TABS. well Text::CSV_XS isnt about csv files as much as delimted files, and will take a tab delimiter just as easily.

    to show you how easy it is to use Text::CSV_XS i rewrote your code using it. I fixed a few things, ya get yelled at here if you dont use 3 arg opens or lexical limited filehandles ($file1,$file2). i fixed the funny next if 1..$N==$.; to just plain next if ($.==1); and like 1nickt i assumed you meant $prog in the second loop rather than $ID. Another thing i did was follow your code rather than your specs, "and print columns 3 and 5 of whole_pedigree" but your code uses columns 4 and 6 instead.

    #!/usr/bin/perl use warnings; use strict; use diagnostics; my $hash1; my $hash2; use Text::CSV_XS; my $sep="\t"; my $csv = Text::CSV_XS->new ({ sep_char => $sep }); open (my $file1, "<","wholepedigree_F.txt") or die "Couldn't open whol +epedigree_F.txt \n"; while (my $row = $csv->getline ($file1)) { next if ($.==1); my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = @$row; if ($ID){ $hash1 -> {$ID} -> {info1} = "$F\t$AHC"; } } close $file1; my $Output=\*STDOUT; open (my $file2, "<","damF.txt") or die "Couldn't open damF.txt\n"; while (my $row = $csv->getline ($file2)) { next if ($.==1); my ($damID, $damF, $damAHC, $prog) = @$row; if ($prog){ $hash2 -> {$prog} -> {info2} = "$damID\t$damF\t$damAHC"; } if ($prog && ($hash1->{$prog})) { my $info1 = $hash1 -> {$prog} -> {info1}; my $info2 = $hash2 -> {$prog} -> {info2}; print "$prog\t$info1\t$info2\n"; } } close $file2; print "Done";
    There are many parts of 1nickt's code that made sense, like skipping $hash2 and the whole {info1} subtree, but i felt if you saw Text::CSV_XS used in a way more like your original code you may appreciate it more. It came with Activestate perl, so i suspect it is in core and doesnt need to be installed either.

    I also didnt use the Text::CSV_XS output routines, so you could comare to your program easier.

      Using the print method of Text::CSV_XS, eliminating $hash2 totally, and eliminating the {info1} level instead just assigning an array ref to $hash1->{$ID}.

      #!/usr/bin/perl use warnings; use strict; use diagnostics; my $hash1; use Text::CSV_XS; my $sep="\t"; my $csv = Text::CSV_XS->new ({ sep_char => $sep }); open (my $file1, "<","wholepedigree_F.txt") or die "Couldn't open whol +epedigree_F.txt \n"; while (my $row = $csv->getline ($file1)) { next if ($.==1); my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = @$row; if ($ID){ $hash1 -> {$ID}= [$F,$AHC]; } } close $file1; my $Output=\*STDOUT; open (my $file2, "<","damF.txt") or die "Couldn't open damF.txt\n"; while (my $row = $csv->getline ($file2)) { next if ($.==1); my ($damID, $damF, $damAHC, $prog) = @$row; if ($prog && ($hash1->{$prog})) { my $status = $csv->print($Output, [$prog,@{$hash1->{$prog}},$da +mID,$damF,$damAHC]); print "\n"; } } close $file2; print "Done";

Re: Trouble iterating through a hash
by Marshall (Canon) on Mar 09, 2017 at 04:20 UTC
    I don't really understand what in FILE2 is supposed to match with what in FILE1? All non-zero data in FILE1 is in some column in FILE2. You do not show a line in FILE2 that would not "match" and you do not provide a sample "good" output.

    I started re-writing your code, but could not proceed to a final solution due to the above. However this code below may help you.

    Some comments to your code:

    • "use vars" does not do what you think and there is absolutely no need for that here.
    • I only see the need for a single hash of FILE1 - a second hash is not necessary.
    • I have no idea about what you think this $N stuff does?
    • There is no need for a multi-dimensional hash here and your syntax to address a simple hash is wrong (don't need the arrows).
    #!/usr/bin/perl use warnings; use strict; use diagnostics; my %hash; #$damID, $damF, $damAHC, $prog my $damF = <<END; 501093 0 0 3162 2958 0 0 3163 1895 0 0 3164 1382 0 0 3165 2869 0 0 3166 END #$ID, $sire, $dam, $F, $FB, $AHC, $FA my $wholepedigree_F = <<END; 3162 2159 501093 0 0 0 0 3163 2960 2958 0 0 0 0 3164 2269 1895 0 0 0 0 3165 1393 1382 0 0 0 0 3166 2881 2869 0 0 0 0 END open (FILE1, '<', \$wholepedigree_F) or die "Couldn't open wholepedigr +ee_F.txt"; while (my $line = <FILE1>) { chomp $line; next unless $line =~ /\S/; #skip blank lines my ($ID, $F, $AHC) = (split (/\s+/, $line))[0,3,5]; $hash{$ID} = "$F\t$AHC"; } close FILE1; open (FILE2, '<', \$damF) or die "Can't open damF.txt"; #open (OUT, ">output.txt") or die "Can't Open output file"; #print OUT "\n"; while (my $line = <FILE2>) { chomp $line; next unless $line =~ /\S/; #skip blank lines my ($damID, $damF, $damAHC, $prog) = split (/\s+/, $line); if ($hash{$prog}) { my $info1 = $hash{$prog}; my $info2 = "$damID\t$damF\t$damAHC"; print "$prog\t$info1\t$info2\n"; } } #close OUT; close FILE2; print "Done";

      I have no idea about what you think this $N stuff does?

      I had a clue, er a guess, so i tested it. next if 1..$N==$.; skips the first line, kinda the hard way.

      I agree with your other points, except that as written $info2 = $hash2 -> {$prog} -> {info2}; doesnt hurt either, it does that when the -> is assumed in $info2 = $hash2 -> {$prog}{info2}; anyway

        I agree with you "skips the first line, kinda the hard way.".

        I would advise the OP to use:

        my $discard_first_line = <FILE1>;
        before starting the while loop.
        The more cryptic albeit simple: <FILE1>; would work also. I prefer the extraneous "my" variable because it is very "cheap" and it is clear what it does. I would personally write a comment like <FILE1>; #discard first line anyway.

        I agree with your other points, except that as written $info2 = $hash2 -> {$prog} -> {info2}; doesnt hurt either, it does that when the -> is assumed in $info2 = $hash2 -> {$prog}{info2}; anyway
        My point is that there is no need at all for a 2nd dimension on the first hash, and no need for the 2nd hash at all!
Re: Trouble iterating through a hash -- oneliner explained
by Discipulus (Canon) on Mar 09, 2017 at 09:34 UTC
    Hello e9292 and welcome to the monastery!,

    First some sparse suggestion about you code style: use vars qw($ID $sire... here is unuseful: first use vars is deprecated and it really means our (see vars and use vars). you just need my ($ID, Ssire ... to declare lexical scoped variables.

    Second you opening open (FILE1, "<wholepedigree_F.txt") or .. is oldish enough: always use the three arg open like: open my $filehandle, '<', $path or ... and use a lexical filehandle istead of the bareword form. Also the low precedence or is there to avoid the necessity of the parens.

    Now: you got good solutions and many wise suggestions, but if you say I want the output to match column 4 of damF.txt and wholepedigree_F.txt and print columns 3 and 5 of whole_pedigree and columns 1,2,3 of damF.txt

    I'd answer with a oneliner (pay attention to MSWin32 doublequote)(PS if I understand correctly your needs as they are stated..)

    perl -F"\s+" -lane "$sec?(push $h{$F[0]},@F[2,4]):($h{$F[3]}=[@F[0..2] +]);$sec++ if eof;END{print map{qq($_ = @{$h{$_}}\n)}sort keys %h}" + mergedhash01.txt mergedhash02.txt 3162 = 501093 0 0 501093 0 3163 = 2958 0 0 2958 0 3164 = 1895 0 0 1895 0 3165 = 1382 0 0 1382 0 3166 = 2869 0 0 2869 0

    See perlrun for -lane commandline options anf for -F too. The END is executed after the implicit while loop created by the -n switch.

    In brief -a is autosplit mode and populate the special variable @F ('F' for fields, see perlvar for this). I have specified, with the -F option, that i want the current line splitted on \s+ instead of a single space that is the default.

    Then -l take care for us of line endings (no need to chomp), -n put an implicit while loop around all the code that will be executed for every line of the files passed as arguments.

    The $sec++ if eof is tricky: this part initialize and set to 1 the variable $sec (for second), ie when processing the first file and whe perl meet the end of file (see eof ) it set this switch like scalar to 1 (well the varible is set to 2 at the end of second file, but then we do not need it anymore).

    Having this switch let us to know when we are processing the second file: infact the core of the oneliner is a IF ? THEN : ELSE ternary operator based on the value of the $sec variable: if false (we are processing the first file) we populate an hash entry $h{$F[3] with an anonymous array containing field 1,2 and 3 ( @F[0..2] ).

    If $sec is true wea re processing the second file so we push in the yet created anonymous array fields 3 and 5 ( @F[2,4] ).

    If you add -MO=Deparse before the other options you'll see the oneliner expanded a bit:

    perl -MO=Deparse -F"\s+" -lane "$sec?(push $h{$F[0]},@F[2,4]):($h{$F[3 +]}=[@F[0..2]]);$sec++ if eof;END{print map{qq($_ = @{$h{$_}}\n)}sort keys %h}" mergedhash01.txt mergedhash +02.txt BEGIN { $/ = "\n"; $\ = "\n"; } LINE: while (defined($_ = <ARGV>)) { chomp $_; our(@F) = split(/\s+/, $_, 0); $sec ? push($h{$F[0]}, @F[2, 4]) : ($h{$F[3]} = [@F[0..2]]); ++$sec if eof; sub END { print map({"$_ = @{$h{$_};}\n";} sort(keys %h)); } ; } -e syntax OK

    HtH and have fun!

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.