Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Getting Lines 2,3, and 4 from FASTQ file's Chunk

by neversaint (Deacon)
on Apr 16, 2010 at 07:05 UTC ( [id://835024]=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have a data that looks like this (fastq format). Note that each entry consist of chunk of four lines. Each chunk begins with an entry with "@"sign and ends with octal values (e.g. (IIII... ).
@SRR016086.1 308GJAAXX:2:1:911:957 length=36 CCCCCCCCCCACCCCCCCACACCACAAACACACCAC +SRR016086.1 308GJAAXX:2:1:911:957 length=36 (IIIIIIII:&H68;I-3%/)*-/4$$')(0&$&,$ @SRR016086.2 308GJAAXX:2:1:916:735 length=36 AGATCAGTTTTGCACTCCACTTAAGTTTGTTCATAT +SRR016086.2 308GJAAXX:2:1:916:735 length=36 I*IIIIIIIIIII%%CI()I2E6.I5;>+"=C+#%I @SRR016086.3 308GJAAXX:2:1:908:956 length=36 CCCCCCCCCAACACAAAAACAAAATCCAAAACAATA +SRR016086.3 308GJAAXX:2:1:908:956 length=36 II@III50I*9I+<00%.<9/,%-"$*2('&"&)""
What I want to do is to extract line 2, 3 and 4 of each chunk from the above file. I've tried Bioperl with this script but fail:
use Bio::SeqIO; my $file = $ARGV[0] || "test.fastq"; my $in = Bio::SeqIO->new(-file => $file , '-format' => 'fastq'); while ( my $seq = $in->next_seq() ) { print "$seq->id\n"; }
Nor with my attempted homebrew method, in which I'm stuck with the logic.
my $INFILE_file_name = $file; # input file name open ( INFILE, '<', $INFILE_file_name ) or croak "$0 : failed to open input $INFILE_file_name : $!\n"; my $count = 0; my $seqlinec = 2; while ( <INFILE> ) { chomp; # Trying to catch line 2 of each chunk.. if ( $seqlinc eq $count ) { print "$_\n"; } $seqlinec += 4; } close ( INFILE ); # close input file
What's the right way to do it?

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Getting Lines 2,3, and 4 from FASTQ file's Chunk
by BrowserUk (Patriarch) on Apr 16, 2010 at 08:01 UTC

    To skip the first of every 4 lines:

    perl -wlne"($.%4)-1 and print" yourFile

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Getting Lines 2,3, and 4 from FASTQ file's Chunk
by Ratazong (Monsignor) on Apr 16, 2010 at 07:17 UTC
    my $count = 0; my $seqlinec = 2; while ( <INFILE> ) { if ( $seqlinc eq $count ) { print "$_\n"; } $seqlinec += 4; }

    I assume you want to somehow count the line you are reading. In that case, you should increment your counter somewhere. Maybe do something like the following (untested!) code

    my $count = 0; while ( <INFILE> ) { next if ( !$count++ ); # ignore line zero, increase linec +ount print $_; $count = 0 if ($count >= 3); # reset the counter once we printe +d line 3 }

    HTH, Rata

    note: of course the code could be easily optimized, e.g. by using ($count++ % 4); see Boolean counter? for some ways (you need to adapt them to count modulo 4)
Re: Getting Lines 2,3, and 4 from FASTQ file's Chunk
by Sinistral (Monsignor) on Apr 16, 2010 at 12:20 UTC

    Although the other monks have given excellent answers, I wanted to make sure you've made note of the many resources at the BioPerl site. Specifically, they have information about your FASTQ sequence format that may have links to tools to help you efficiently process the data (perhaps their information on Bio:SeqIO::fastq also available at Bio::SeqIO::fastq).

Re: Getting Lines 2,3, and 4 from FASTQ file's Chunk
by umasuresh (Hermit) on Apr 16, 2010 at 14:22 UTC
    Another simple solution:
    use strict; use warnings; while(<DATA>) { chomp; my $line = $_; # skip the first line of every chunk if ($line =~/^@/) { } # print the remaining lines else { print "$line\n"; } } __DATA__ @SRR016086.1 308GJAAXX:2:1:911:957 length=36 CCCCCCCCCCACCCCCCCACACCACAAACACACCAC +SRR016086.1 308GJAAXX:2:1:911:957 length=36 (IIIIIIII:&H68;I-3%/)*-/4$$')(0&$&,$ @SRR016086.2 308GJAAXX:2:1:916:735 length=36 AGATCAGTTTTGCACTCCACTTAAGTTTGTTCATAT +SRR016086.2 308GJAAXX:2:1:916:735 length=36 I*IIIIIIIIIII%%CI()I2E6.I5;>+"=C+#%I @SRR016086.3 308GJAAXX:2:1:908:956 length=36 CCCCCCCCCAACACAAAAACAAAATCCAAAACAATA +SRR016086.3 308GJAAXX:2:1:908:956 length=36 II@III50I*9I+<00%.<9/,%-"$*2('&"&)""

      It is a simple solution but it would be even simpler if you avoided the un-necessary chomp and assignment to $line, and used an unless statement modifier. I don't know if it is safe to assume that every record begins with an "@" symbol.

      while ( <DATA> ) { print unless m{^\@}; }

      I hope this is of interest.

      Cheers,

      JohnGG

      Skipping lines based on '@' at the start is dangerous as quality values (line 4) can be '@'!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://835024]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-16 15:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found