Re^3: Can't access data stored in Hash

Hello again, corcra,

I’m glad to have been of help.

If I understand you correctly, you now want read the data headings, say:

Field:    0            5            6            7            8
File 1: ['CHROM', ... 'SAMPLE_1A', 'SAMPLE_1B', 'SAMPLE_2A', 'SAMPLE_2
+B']
File 2: ['CHROM', ... 'SAMPLE_1',  'SAMPLE_2',  'SAMPLE_3']
[download]

and have the script deduce that File 1 data in fields 5 and 6 should each be compared to File 2 field 5, File 1 data in fields 7 and 8 should each be compared to File 2 field 6, and so on.

That makes the logic more complex, but I don’t know why you think this will be difficult to do line-by-line? Most of the added logic comes before the big while loop:

...
my $header1 = <$in1>;
<$in2>;
my @heads1  = split /\s*,\s*/, $header1;
my $index   = 5;
my %index_map;

for (@heads1)
{
    $index_map{$index++} = $1 + 4 if /SAMPLE_(\d+)/;
}

print $header1;

while (my $line1 = <$in1>)
{
    my @fields1 = get_fields($line1);

    defined(my $line2 = <$in2>)
        or die "Data missing in file '$file2': $!";

    my @fields2 = get_fields($line2);
    my @out     = @fields1;

    for my $i (5 .. $#fields1)
    {
        if ($fields1[$i] ne 'REF')
        {
            my $j    = $index_map{$i};
            $out[$i] = $fields2[$j] if exists $fields2[$j] &&
                                              $fields2[$j] ne 'REF';
        }
    }
    
    @out = map { "'$_'" } @out;
    print '[', join(', ', @out), "]\n";
}
...
[download]

The main addition is a hash (%index_map) to keep track of the correspondences between the fields in File 1 and the matching fields in File 2.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re^3: Can't access data stored in Hash - help! Select or Download Code

Replies are listed 'Best First'.

Re^4: Can't access data stored in Hash - help!
by corcra (Initiate) on Aug 10, 2014 at 14:46 UTC

Hi Athanasius, Hope you are well and thank you for all of your help so far. Unfortunately I still have not been able to get the code working as I want it to and I was hoping you could have one further look at the code I am running. The format of my input files has changed slightly but I don't think this will affect the code too much.

File 1
CHROM    POS    REF    ALT    VARIANT_LIST    209T-D  459T-D    644T-D
+    94T-D    99T1-D    99T2-D    99T3-D    99T4-D    99T5-D    
['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', '
+A', 'A']
['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'T', 'T', 'T', 'T', 'T', 'T'
+, 'T']
...
[download]

File 2 
CHROM     POS     REF     ALT     VARIANT_LIST    209H-D    459H-D    
+644H-D    94H-D    99H-D    
['MT', '1010', 'G', 'A', 'A', 'REF', 'REF', 'REF', 'REF', 'REF']
['MT', '2962', 'C', 'T', 'T', 'REF', 'REF', 'T', 'REF', 'T']
....
[download]

 
CHROM    POS    REF    ALT    VARIANT_LIST    209T-D  459T-D    644T-D
+    94T-D    99T1-D    99T2-D    99T3-D    99T4-D    99T5-D    
['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', '
+A', 'A']
['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'REF, 'T', 'REF', 'REF', 'RE
+F', 'REF', 'REF']
[download]

#!/usr/local/bin/perl 
use strict;
use warnings;

my $file1 = shift;
my $file2 = shift;

open(my $in1, '<', $file1)
    or die "Cannot open file '$file1' for reading: $!";

open(my $in2, '<', $file2)
    or die "Cannot open file '$file2' for reading: $!";

my $header1 = <$in1>;
<$in2>;
my @heads1  = split "\t", $header1;
my $index = 5;
my %index_map;

for (@heads1)
{
    $index_map{$index++} = $1 + 4 if m/^(\d+)/;
}

print $header1;
while (my $line1 = <$in1>)
{
    my @fields1 = get_fields($line1);

    defined(my $line2 = <$in2>)
        or die "Data missing in file '$file2': $!";

    my @fields2 = get_fields($line2);
    my @out     = @fields1;

 for my $i (5 .. $#fields1)
    {
    my $j = $index_map{$i};
        
    if ($fields1[$i] ne 'REF')
            {
        
            $out[$i] = $fields2[$j] if exists $fields2[$j] &&
                        $fields2[$j] ne 'REF';

        }

}
    @out = map { "'$_'" } @out;
    print '[', join(', ', @out), "]\n";
}

close $in2
    or die "Cannot close file '$file2': $!";

close $in1
    or die "Cannot close file '$file1': $!";

sub get_fields
{
    my ($line) = @_;

    chomp $line;
    my @fields = split "\t", $line;
    s{ ^ \[? ' }{}x for @fields;
    s{ ' \]? $ }{}x for @fields;

    return @fields;
}
[download]

CHROM    POS    REF    ALT    VARIANT_LIST    209T-D  459T-D    644T-D
+    94T-D    99T1-D    99T2-D    99T3-D    99T4-D    99T5-D    
['MT', '1010', 'G', 'A', 'A', 'REF', 'A', 'A', 'REF', 'A', 'A', 'A', '
+A', 'A']
['MT', '2962', 'C', 'T', 'T', 'REF', 'T', 'T', 'T', 'T', 'T', 'T', 'T'
+, 'T']
...
[download]

[reply]
[d/l]
[select]

Re^5: Can't access data stored in Hash - help!

by Athanasius (Archbishop) on Aug 11, 2014 at 13:58 UTC

Hello again corcra,

Looks like a number of things have changed in your specification. First, the input file headers (but not the data) have lost their square brackets and quotation marks. Second, the criteria for generating the output have changed. From the original post:

I am trying to write a code which prints out file 1 again but if the sample value is not 'REF', looks up file 2. If the corresponding file 2 value is 'REF' then print the original value appearing in file 1. If the corresponding value in file 1 is not 'REF' then print the value we find in file 2.

But now you say:

the same letter at the same position in File 1 and File 2 should output REF, and if REF is found at that position in file 2, just output letter from File 1

Third, the numbering of input file column headers no longer begins at 1. Will these numbers always increase in value from left to right? Probably safer to assume not, and to formulate a more general solution. In the following, I have simplified the input file formats by removing the square brackets and commas from the data as well as the headers: