First File T G T T T G T C A A ......... C G G C T C T G G C ......... . . . . . . . . . . . . . . . . . Second File T G C T T G T C G A G A G C G A A G G T A G T A G T T C A G T C G C . . . . . . . .

The first file contains all the sequence data. Each row represents one individual, and every two columns represent one SNP marker (two alleles per SNP). There are 2600 rows and 4100 columns in total (2050 SNP markers).

The second file contains all the 'minor' and 'major' alleles for all the markers (minor allele in column 1 and major allele in column 2). The minor allele represents the allele for each SNP marker with the lowest frequency of occurence out of the two possible alleles, and the major allele is the one with the highest frequency. There are 2050 rows in total, which matches the number of total SNP markers in the sample, and each line correlates with a pair of alleles in the first file. Essentially, each pair of alleles in the first file can be any permutation of the combinations between the matching row of minor and major alleles.

The individual is homozygous at a marker if the pair of alleles are the same (ie TT or AA). The individual is heterozygous at a marker if the pair of alleles are different (ie TG or GT). The individual has missing data at a marker if the pair of alleles is '0 0'.

Desired operation: Reading the first file one row at a time and two columns at a time (two alleles at a time), if the pair of alleles is homozygous for the matching minor allele (column 1) in the second file for that marker, then output a '0'. If the pair is heterozygous (a combination of the minor and major alleles), then output a '1'. If the pair of alleles is homozygous for the matching major allele (column 2), then output a '2'. If the pair is missing ('0 0'), then output a -1. The resulting file should have 2600 rows and 2050 columns, representing the total number of individuals and SNP markers, respectively.


In reply to Gurus, please point me in the right direction; complicated operations desired for DNA sequence formating by Renyulb28

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.