gudluck has asked for the wisdom of the Perl Monks concerning the following question:
I need to process very large (>10GB) tab delimited Pileup files.
I am reading the file and if it satisfies some criteria iam concatinating colum4 letters to a string. I have few issues to address. 1. Some times a line before indel (* in 3rd(t2) column) have multiple extra strings representing an insert (+1A or +4caaa) or deletion (-4gtct or -4tctt or -13GGCGCGCGTGCGC ) strings in the read colum (in column 9; t8) how to get rid of them. I tried awk to preprocess the file to remove them by brute force awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' but suppose if the nucleotide immediately after the above string (-4gtct) is not a (.,) and if it is (atgcATGC) then it will also be removed. I need to specifically look for +/- pattern followed by a number(digits) and the deleting that many number of letters following it (look for -4 then remove -4 and gtct)
2. when ever there is an indel line (* in 3rd(t2) column) I need to check colum 4 (t3) and see if it is (*/* or */+ or */- or +/+ or +/- or -/- or -/+ ) insertion (+) or deletion (-) on the numerator side (ie +/- means i need to treat it as insert(+) only and if it is */- or */+ then i need to check if denominator (- or +) is represented by more than 5% of reads (i.e column12/column8) then for deletion (-) convert that many nucleotides line following it to dash - in column 4 (t3). and if it is an insertion (+) then add those addition nucleotides to the string. Please help with suggetionsrepresentative sample Pileup file (it is continous from 1 to n(column2) (here lines are discontious as I pasted few representative lines here and there)
#t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 2L 1 G A 48 48 26 7 AAaaAAA ACCCD?@ 2L 2 A A 48 0 26 7 ..,,... AACC@BC 2L 3 T K 18 18 26 7 .Ggg... ACCCCCC 2L 4 C C 57 0 37 16 .,..,,.,a.,,,,,^F, CBCCBBC@3CCBBD=? 2L 5 C C 54 0 37 17 .,..,,.,a.,,,,,,^F, CDCCDACB9CCA@B?@D 2L 6 c C 78 0 37 17 .,..,,.,,.,,,,,,, CBCCBAC>9CBB@BA?A 2L 7 c C 45 0 37 19 .,..,,.,,.+1A,,,,t,,^F,^Ft CDCCDAC>;C +BCBD*?DAA 2L 7 * */* 55 0 37 19 * A 18 1 0 0 0 2L 8 a A 162 0 21 45 ,-4gtct,-4gtct,,....,,,,,,.,.,,,,.,,,, +,,,,...,.,.,,,,.,,, CCCCA<;=/CCACDBCBCCCCDCCCCBBCCBDCCCCCDCDDCCAB 2L 8 * */* 173 0 21 45 * -gtct 43 2 0 0 0 2L 9 g G 87 0 37 20 .,..,,.,,.,,,,,,,,,^F, CDCCCAC?BCBBCD +?<DBBB 2L 10 a A 87 0 37 20 .,..,,.,,.,,,,,,,,,, CBCDABC>BCACBB +:AC?4C 2L 11 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,^F. CDCCCAC@@CDDCC +<:DBABC 2L 12 c C 117 0 14 31 ...,,*.,,-2gt.,,.,,.,..........,,.. ;B +6CCBBCBCACDCCCB@DCDCC8CCCABCC 2L 12 * */-gt 21 21 14 31 * -gt 30 1 0 0 0 2L 13 g G 90 0 37 21 .,..,,.,,.,,,,,,,,,,. CDCCCAC?:CDB5? +<>BB>BC 2L 14 t T 45 0 37 6 ..,,,.-1T CCCCCC 2L 14 * -t/-t 56 59 37 6 -t * 1 5 0 0 0 2L 15 t T 93 0 37 22 .,..,,.,,.,,,G,,,,,,.^F. CDCDDDC<@C +BB5B@ABC<ACC 2L 16 G G 178 0 36 50 .$,$,$,$,,,,,.,,,,..,.,,...,,,,.,,..., +....,...,,,$,,,., BCCCCCCCCCCCCBDBCBCCB>8CCCCBCCBDBC8>ACCBB6CC?CCC6C 2L 17 A A 59 0 36 45 ,$,$,,,C,,,c..,.,,C..,,,,.,,...,....,. +.C,,,,,$., CCCCCBCCCBB8CACC>@;CCCCDCCCDAC-5ABCAA2CCCC?1A 2L 18 w T 54 54 37 9 tTTTTttTT CCDCCDCCC 2L 18 * A/+A 65 279 37 9 A * 8 1 0 0 0 2L 19 g G 54 0 37 9 ....,,... >>@@CC>BB 2L 20 c C 57 0 37 10 ....,,...^F, CACCCC?CBB 2L 21 t T 57 0 37 10 ....,,..., BCDCCCCCB? 2L 650 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?CCA=? 2L 650 * */* 9 0 34 7 * +CAAA 5 2 0 0 0 2L 654 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA 2L 654 * */* 19 0 34 7 * +G 6 1 0 0 0 2L 2332 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A +,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC 2L 2332 * */* 61 0 14 33 * -ggcgcgcgtgcgc 32 1 0 0 + 0 2L 3334 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T,-7attattt, +,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCC +CCDCCCCCCCCBCBCDCC 2L 3334 * */-attattt 27 27 15 49 * -attattt 47 2 0 + 0 0 2L 3928 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, + CCC8CCCCBCA 2L 3928 * */-tctt 157 157 0 11 * -tctt 8 3 0 0 0
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: how to process very large tab limited pileup file
by lima1 (Curate) on Oct 01, 2010 at 21:13 UTC | |
by gudluck (Novice) on Oct 01, 2010 at 21:31 UTC | |
by graff (Chancellor) on Oct 02, 2010 at 17:33 UTC | |
|
Re: how to process very large tab limited pileup file
by johngg (Canon) on Oct 02, 2010 at 12:32 UTC | |
|
Re: how to process very large tab limited pileup file
by umasuresh (Hermit) on Oct 02, 2010 at 13:59 UTC | |
by gudluck (Novice) on Oct 06, 2010 at 23:20 UTC |