in reply to Can't access data stored in Hash - help!

Hello corcra, and welcome to the Monastery!

You begin by reading the whole of file 2, and saving all the data you will need later when processing file 1. This is inefficient, and may be problematic if file 2 is large. A better strategy is to read the two input files together, one line at a time:

#! perl use strict; use warnings; my $file1 = 'File1.txt'; my $file2 = 'File2.txt'; open(my $in1, '<', $file1) or die "Cannot open file '$file1' for reading: $!"; open(my $in2, '<', $file2) or die "Cannot open file '$file2' for reading: $!"; print scalar <$in1>; <$in2>; while (my $line1 = <$in1>) { my @fields1 = get_fields($line1); defined(my $line2 = <$in2>) or die "Data missing in file '$file2': $!"; my @fields2 = get_fields($line2); my @out = @fields1; for my $i (5 .. $#fields1) { if ($fields1[$i] ne 'REF' && $i <= $#fields2 && $fields2[$i] ne 'REF') { $out[$i] = $fields2[$i]; } } @out = map { "'$_'" } @out; print '[', join(', ', @out), "]\n"; } close $in2 or die "Cannot close file '$file2': $!"; close $in1 or die "Cannot close file '$file1': $!"; sub get_fields { my ($line) = @_; chomp $line; my @fields = split /\s*,\s*/, $line; s{ ^ \[? ' }{}x for @fields; s{ ' \]? $ }{}x for @fields; return @fields; }

Output:

13:39 >perl 959_SoPW.pl ['CHROM', 'POS', 'REF', 'ALT', 'LIST', 'SAMPLE_1A', 'SAMPLE_2A', 'SAMP +LE_3A'] ['M', '16', 'T', 'C', 'C', 'REF', 'C', 'REF'] ['M', '381', 'T', 'A', 'A', 'A', 'REF', 'REF'] ['M', '529', 'A', 'G', 'G', 'REF', 'G', 'REF'] 13:39 >

Note: In the above code I’ve assumed that the data files are formatted as you’ve shown. But if (as I half suspect) they are actually formatted as proper CSV files, then you will be better served reading them with one of the modules designed for this purpose, such as Text::CSV_XS.

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Can't access data stored in Hash - help!
by corcra (Initiate) on Aug 05, 2014 at 10:16 UTC

    Athanasius, this was a great help! Thank you. In fact, I expect that file 2 will be quite large so this method should work better.

    I have an additional question related to this code but I don't know whether it should be posted here or as a new post. I will chance asking here and remove it if there is a problem. I want to ensure that for Sample_2A, Sample_2B, Sample_2C in file_1 each of these columns will be compared to Sample_2 in file 2 i.e. that columns with matching numbers are compared but I am not sure of the best way to do this but in the code you suggested it might be difficult to do this since the fields are broken up line-by-line

      Hello again, corcra,

      I’m glad to have been of help.

      If I understand you correctly, you now want read the data headings, say:

      Field: 0 5 6 7 8 File 1: ['CHROM', ... 'SAMPLE_1A', 'SAMPLE_1B', 'SAMPLE_2A', 'SAMPLE_2 +B'] File 2: ['CHROM', ... 'SAMPLE_1', 'SAMPLE_2', 'SAMPLE_3']

      and have the script deduce that File 1 data in fields 5 and 6 should each be compared to File 2 field 5, File 1 data in fields 7 and 8 should each be compared to File 2 field 6, and so on.

      That makes the logic more complex, but I don’t know why you think this will be difficult to do line-by-line? Most of the added logic comes before the big while loop:

      The main addition is a hash (%index_map) to keep track of the correspondences between the fields in File 1 and the matching fields in File 2.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Hi Athanasius, Hope you are well and thank you for all of your help so far. Unfortunately I still have not been able to get the code working as I want it to and I was hoping you could have one further look at the code I am running. The format of my input files has changed slightly but I don't think this will affect the code too much.

        File 1 CHROM POS REF ALT VARIANT_LIST 209T-D 459T-D 644T-D + 94T-D 99T1-D 99T2-D 99T3-D 99T4-D 99T5-D ['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', ' +A', 'A'] ['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'T', 'T', 'T', 'T', 'T', 'T' +, 'T'] ...
        File 2 CHROM POS REF ALT VARIANT_LIST 209H-D 459H-D +644H-D 94H-D 99H-D ['MT', '1010', 'G', 'A', 'A', 'REF', 'REF', 'REF', 'REF', 'REF'] ['MT', '2962', 'C', 'T', 'T', 'REF', 'REF', 'T', 'REF', 'T'] ....
        Again I want to compare 99T1, 99T2, 99T3, 99T4 and 99T5 of File1 with 99H of file 2. The output I want looks like
        CHROM POS REF ALT VARIANT_LIST 209T-D 459T-D 644T-D + 94T-D 99T1-D 99T2-D 99T3-D 99T4-D 99T5-D ['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', ' +A', 'A'] ['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'REF, 'T', 'REF', 'REF', 'RE +F', 'REF', 'REF']
        (because the same letter at the same position in File 1 and File 2 should output REF, and if REF is found at that position in file 2, just output letter from File 1) The code I am currently running is:
        #!/usr/local/bin/perl use strict; use warnings; my $file1 = shift; my $file2 = shift; open(my $in1, '<', $file1) or die "Cannot open file '$file1' for reading: $!"; open(my $in2, '<', $file2) or die "Cannot open file '$file2' for reading: $!"; my $header1 = <$in1>; <$in2>; my @heads1 = split "\t", $header1; my $index = 5; my %index_map; for (@heads1) { $index_map{$index++} = $1 + 4 if m/^(\d+)/; } print $header1; while (my $line1 = <$in1>) { my @fields1 = get_fields($line1); defined(my $line2 = <$in2>) or die "Data missing in file '$file2': $!"; my @fields2 = get_fields($line2); my @out = @fields1; for my $i (5 .. $#fields1) { my $j = $index_map{$i}; if ($fields1[$i] ne 'REF') { $out[$i] = $fields2[$j] if exists $fields2[$j] && $fields2[$j] ne 'REF'; } } @out = map { "'$_'" } @out; print '[', join(', ', @out), "]\n"; } close $in2 or die "Cannot close file '$file2': $!"; close $in1 or die "Cannot close file '$file1': $!"; sub get_fields { my ($line) = @_; chomp $line; my @fields = split "\t", $line; s{ ^ \[? ' }{}x for @fields; s{ ' \]? $ }{}x for @fields; return @fields; }
        which should work but the output is
        CHROM POS REF ALT VARIANT_LIST 209T-D 459T-D 644T-D + 94T-D 99T1-D 99T2-D 99T3-D 99T4-D 99T5-D ['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', ' +A', 'A'] ['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'T', 'T', 'T', 'T', 'T', 'T' +, 'T'] ...
        I really do not know what is going wrong? If you have any suggestions please let me know. Thanks!