I need to process very large (>10GB) tab delimited Pileup files.

I am reading the file and if it satisfies some criteria iam concatinating colum4 letters to a string. I have few issues to address. 1. Some times a line before indel (* in 3rd(t2) column) have multiple extra strings representing an insert (+1A or +4caaa) or deletion (-4gtct or -4tctt or -13GGCGCGCGTGCGC ) strings in the read colum (in column 9; t8) how to get rid of them. I tried awk to preprocess the file to remove them by brute force  awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' but suppose if the nucleotide immediately after the above string (-4gtct) is not a (.,) and if it is (atgcATGC) then it will also be removed. I need to specifically look for +/- pattern followed by a number(digits) and the deleting that many number of letters following it (look for -4 then remove -4 and gtct)

2. when ever there is an indel line (* in 3rd(t2) column) I need to check colum 4 (t3) and see if it is (*/* or */+ or */- or +/+ or +/- or -/- or -/+ ) insertion (+) or deletion (-) on the numerator side (ie +/- means i need to treat it as insert(+) only and if it is */- or */+ then i need to check if denominator (- or +) is represented by more than 5% of reads (i.e column12/column8) then for deletion (-) convert that many nucleotides line following it to dash - in column 4 (t3). and if it is an insertion (+) then add those addition nucleotides to the string.

Please help with suggetions

representative sample Pileup file (it is continous from 1 to n(column2) (here lines are discontious as I pasted few representative lines here and there)

#t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 2L 1 G A 48 48 26 7 AAaaAAA ACCCD?@ 2L 2 A A 48 0 26 7 ..,,... AACC@BC 2L 3 T K 18 18 26 7 .Ggg... ACCCCCC 2L 4 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBCCBBC@3CCBBD=? 2L 5 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CDCCDACB9CCA@B?@D 2L 6 c C 78 0 37 17 .,..,,.,,.,,,,,,, CBCCBAC>9CBB@BA?A 2L 7 c C 45 0 37 19 .,..,,.,,.+1A,,,,t,,^F,^Ft CDCCDAC>;C +BCBD*?DAA 2L 7 * */* 55 0 37 19 * A 18 1 0 0 0 2L 8 a A 162 0 21 45 ,-4gtct,-4gtct,,....,,,,,,.,.,,,,.,,,, +,,,,...,.,.,,,,.,,, CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDCCCCCDCDDCCAB 2L 8 * */* 173 0 21 45 * -gtct 43 2 0 0 0 2L 9 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, CDCCCAC?BCBBCD +?<DBBB 2L 10 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, CBCDABC>BCACBB +:AC?4C 2L 11 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. CDCCCAC@@CDDCC +<:DBABC 2L 12 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,..........,,.. ;B +6CCBBCBCACDCCCB@DCDCC8CCCABCC 2L 12 * */-gt 21 21 14 31 * -gt 30 1 0 0 0 2L 13 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,. CDCCCAC?:CDB5? +<>BB>BC 2L 14 t T 45 0 37 6 ..,,,.-1T CCCCCC 2L 14 * -t/-t 56 59 37 6 -t * 1 5 0 0 0 2L 15 t T 93 0 37 22 .,..,,.,,.,,,G,,,,,,.^F. CDCDDDC<@C +BB5B@ABC<ACC 2L 16 G G 178 0 36 50 .$,$,$,$,,,,,.,,,,..,.,,...,,,,.,,..., +....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC8>ACCBB6CC?CCC6C 2L 17 A A 59 0 36 45 ,$,$,,,C,,,c..,.,,C..,,,,.,,...,....,. +.C,,,,,$., CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC?1A 2L 18 w T 54 54 37 9 tTTTTttTT CCDCCDCCC 2L 18 * A/+A 65 279 37 9 A * 8 1 0 0 0 2L 19 g G 54 0 37 9 ....,,... >>@@CC>BB 2L 20 c C 57 0 37 10 ....,,...^F, CACCCC?CBB 2L 21 t T 57 0 37 10 ....,,..., BCDCCCCCB? 2L 650 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?CCA=? 2L 650 * */* 9 0 34 7 * +CAAA 5 2 0 0 0 2L 654 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA 2L 654 * */* 19 0 34 7 * +G 6 1 0 0 0 2L 2332 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A +,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC 2L 2332 * */* 61 0 14 33 * -ggcgcgcgtgcgc 32 1 0 0 + 0 2L 3334 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T,-7attattt, +,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCC +CCDCCCCCCCCBCBCDCC 2L 3334 * */-attattt 27 27 15 49 * -attattt 47 2 0 + 0 0 2L 3928 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, + CCC8CCCCBCA 2L 3928 * */-tctt 157 157 0 11 * -tctt 8 3 0 0 0

In reply to how to process very large tab limited pileup file by gudluck

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.