I have a series of records (4 lines per record) in FastQ format from DNA sequencing projects. The data actually come for 3 different replicates that are "barcoded". The barcode is basically the first 9 letters in the second line of the record. In the example below:
@DJB775P1:314:C0K9AACXX:3:1101:1597:2067 1:N:0:
AGTTTGTATAGCACGCGCGCCAGCAGCTGCAGCAGGTGCCCGGGCTGCTGG
+
CCCFFFFFHHHHHJJJJJJJIJJJJJIJIJJJIJJJFGIJJIJHGFFFFEE
@DJB775P1:314:C0K9AACXX:3:1101:1644:2074 1:N:0:
CTTGGTTAGAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCAGATCGGAAGA
+
CCCFFDFFHHHHHJJJJJJJJJJJJJJFHIJCGIDGIGHGIJJJJJJHIBH
@DJB775P1:314:C0K9AACXX:3:1101:1707:2079 1:N:0:
ACGACCTTATAAACGGGTGGGGTCCGCGCAGTCCGCCCGGAGGATTAGATC
+
@CCFFFFFHGGHHJJJJ<CGHJCGIJIIJ@AFCHEHHHBEACD?CCDCCDD
@DJB775P1:314:C0K9AACXX:3:1101:1543:2082 1:N:0:
TCTTTGTTCAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGAGATCG
+
@@CFFFDFHHHHHJJJJEIJIJIJJJJJJJJJJJD9B<CDDDCDDDDDCDD
There are 4 records and the first 9 letters for each record are the following:
AGTTTGTAT
CTTGGTTAG
ACGACCTTA
TCTTTGTTC
Thus, the general form of the barcode is the following:
NNNXXXXNN , where N can be either of the following letters: A, C, T or G and XXXX is either TTGT, GGTT or ACCT (one for each library). As you can see record 1 and record 4 belong to the same replicate.
I would like to separae the records that I have (currently in a single file) into 3 separate files based on the either TTGT, GGTT or ACCT. Currently, the main file is about ~34GB in size.
I wonder if anyone can make a suggestion about how I can go about doing this using perl and bioperl module. I am a complete newbie in perl programming.
Thank you very much for your time and for reading my post.
Regards,
snakebites
In reply to Deconvolutinng FastQ files by snakebites
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |