Renyulb28 has asked for the wisdom of the Perl Monks concerning the following question:

First File T G T T T G T C A A ......... C G G C T C T G G C ......... . . . . . . . . . . . . . . . . . Second File T G C T T G T C G A G A G C G A A G G T A G T A G T T C A G T C G C . . . . . . . .

The first file contains all the sequence data. Each row represents one individual, and every two columns represent one SNP marker (two alleles per SNP). There are 2600 rows and 4100 columns in total (2050 SNP markers).

The second file contains all the 'minor' and 'major' alleles for all the markers (minor allele in column 1 and major allele in column 2). The minor allele represents the allele for each SNP marker with the lowest frequency of occurence out of the two possible alleles, and the major allele is the one with the highest frequency. There are 2050 rows in total, which matches the number of total SNP markers in the sample, and each line correlates with a pair of alleles in the first file. Essentially, each pair of alleles in the first file can be any permutation of the combinations between the matching row of minor and major alleles.

The individual is homozygous at a marker if the pair of alleles are the same (ie TT or AA). The individual is heterozygous at a marker if the pair of alleles are different (ie TG or GT). The individual has missing data at a marker if the pair of alleles is '0 0'.

Desired operation: Reading the first file one row at a time and two columns at a time (two alleles at a time), if the pair of alleles is homozygous for the matching minor allele (column 1) in the second file for that marker, then output a '0'. If the pair is heterozygous (a combination of the minor and major alleles), then output a '1'. If the pair of alleles is homozygous for the matching major allele (column 2), then output a '2'. If the pair is missing ('0 0'), then output a -1. The resulting file should have 2600 rows and 2050 columns, representing the total number of individuals and SNP markers, respectively.

  • Comment on Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
  • Download Code

Replies are listed 'Best First'.
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by wind (Priest) on Mar 18, 2011 at 16:02 UTC

    Don't know if it's relevant to your project, but bioperl might be of interest to you

    Also, your description implies that there might be spaces in file1 between each character. However, you might've just added them to make your question more readable. Either way, you could use the following code to process each line two characters at a time.

    After that it's just doing your comparison and outputting to a new file from what you describe.

    while (<DATA>) { chomp; my @letters = m/(\w)/g; while (@letters) { my $x = shift @letters; my $y = shift @letters; print "($x, $y) "; } print "\n"; } __DATA__ T G T T T G T C A A C G G C T C T G G C
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by educated_foo (Vicar) on Mar 18, 2011 at 16:15 UTC
    How much are you paying the nice people at Perlmonks to do your job for you?
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by umasuresh (Hermit) on Mar 18, 2011 at 16:28 UTC
    What have you tried so far?
    Some ideas:
    1. grow a query array of 2050 SNP markers
    2. split each SNP marker into alleles
    3. grow a array with the second file
    4. if($allele =~/$minor_allele|$major_allel/), print desired number
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by raybies (Chaplain) on Mar 18, 2011 at 16:04 UTC
    Hey Reny, I don't know the first thing about homozygous alleles (snarfle! you biologists have the whackiest names... must be great fun at parties), or whatever... so I'm probably no help, but have you examined BioPerl?
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by choroba (Cardinal) on Mar 18, 2011 at 16:07 UTC
    Did I understand your specification correctly?
    use warnings; use strict; open my $IN,'<','894013-1' or die "Cannot open file1: $!\n"; open my $AL,'<','894013-2' or die "Cannot open file2: $!\n"; my @alleles = map {chomp; split / /} <$AL>; while (<$IN>) { chomp; my @individual = split / /; for my $half (0 .. @individual/2-1) { my %pair; my ($pos1,$pos2) = ($half*2,$half*2+1); undef @pair{@individual[$pos1,$pos2]}; my @pair = keys %pair; my ($minor,$major) = @alleles[$pos1,$pos2]; if (@pair == 1) { if ($pair[0] eq $minor) { print 0; } elsif ($pair[0] eq $major) { print 2; } else { print -1; } } else { if (exists $pair{$minor} and exists $pair{$major}) { print 1; } else { print -1; } } print ' '; } print "\n"; }
Re: Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating
by CountZero (Bishop) on Mar 19, 2011 at 08:50 UTC
    Here is a possible solution:
    use Modern::Perl; # Process the second file only once my @alleles; { open my $secondfile, '<', './secondfile.txt' or die "Could not ope +n second file. $!"; while (<$secondfile>) { chomp; my ($minor, $major) = split; push @alleles, { "$minor$minor" => 0, "$minor$major" => 1, "$major$minor" => 1, "$major$major" => 2, '00' => -1, } } } #Process the first file open my $firstfile, '<', './firstfile.txt' or die "Could not open firs +t file. $!"; while (<$firstfile>) { chomp; say "Working on $_"; my @snps = split; my $index; my @results; while (@snps) { my $marker = shift(@snps) . shift(@snps); push @results, ($alleles[$index])->{$marker}; $index++; } say join ',', @results; }
    It first does some pre-processing on the second file, so you only have to read it once and not once for every line in the first file. It does so by setting up an Array of Hashes. The keys of the inner hashes are the various combinations of the minor and major alleles and the values are -1, 0, 1 or 2 according to the information you gave. It directly maps the combinations of minor/major to its values so you do not have to run a number of if tests later.

    By the way, when running this script on the data you provided it appears that the data for the second individual is quite wrong: it cannot have these alleles at these places, since they do not match what is in the second file.

    Finally, writing the results to a file is left as an exercise for the readers.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James