comment on

Update: Changed chunk_size to '2M'.

Update: Added full example.

Update: Added missing tr line to trim white space.

For the spirit of Perl and Bioinformaticians at large, the following does the same thing by utilizing the record separator option in MCE. The "\n>" is a special case which anchors ">" at the start of the line. Workers receive records beginning with ">" and ending in "\n".

The following demonstration is fast for small and large sequences. A chunk_size greater than 8192 means to read at least the number of bytes. Perl will read until the record separator. A worker may receive 1 or several records depending on the size of the record(s).

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n";

mce_flow {
   max_workers => 4,
   chunk_size  => '2m',
   input_data  => "input_file.fasta",
   RS          => "\n>",
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $name, $output );

   for ( @{ $chunk_ref } ) {
      /^>(\w+)/; $name = $1;
      tr/\t\r\n //d; # trim white space

      while ( $_ =~ /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) {
         $output .= "$name: $1, $2, $3\n";
      }
   }

   print $out_fh $output if length($output);
};
[download]

The following demonstration was created mainly as a template for extracting the seq_id, seq_desc, and sequence separately and doing so with low memory consumption. Basically, the whole header line is trimmed from the record leaving just sequence in $_ without Perl making an extra copy.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n";

mce_flow {
   max_workers => 4,
   chunk_size  => '2m',
   input_data  => "input_file.fasta",
   RS          => "\n>",
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $pos, $hdr, $seq_id, $seq_desc, $output );

   for ( @{ $chunk_ref } ) {
      $pos = index($_, "\n") + 1;
      $hdr = substr($_, 0, $pos - 1);

      # skip the first record, e.g. comment at the top of the file
      next if ( $chunk_id == 1 && substr($hdr, 0, 1) ne '>' );

      # extract seq_id and seq_desc
      $hdr =~ /^>(\w+)\s*([^\r\n]*)/;
      $seq_id = $1, $seq_desc = $2;

      # $_ becomes sequence, without making an extra copy
      substr($_, 0, $pos, '');

      # trim any white space in sequence
      tr/\t\r\n //d;

      # for printing ">header\nsequence\n", uncomment the next 3 lines
      # ( length $seq_desc )
      #    ? print ">$seq_id $seq_desc\n$_\n"
      #    : print ">$seq_id\n$_\n";

      # loop through match patterns
      while ( /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) {
         $output .= "$seq_id: $1, $2, $3\n";
      }
   }

   print $out_fh $output if length($output);
};
[download]

Regards, Mario.

In reply to Re^2: Regular expressions across multiple lines by marioroy
in thread Regular expressions across multiple lines by abcd

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.