Bioinfocoder has asked for the wisdom of the Perl Monks concerning the following question:

I need to write a perl script to read gzipped fastq files from a text file list of their paths and then concatenate them together and output to a new gzipped file. ( I need to do this in perl as it will be implemented in a pipeline) I am not sure how to accomplish the zcat and concatenation part, as the file sizes would be in Gbs, I need to take care of the storage and run time as well.

So far I can think of it as -
use strict; use warnings; use IO::Compress::Gzip qw(gzip $GzipError) ; #-------check the input file specified-------------# $num_args = $#ARGV + 1; if ($num_args != 1) { print "\nUsage: name.pl Filelist.txt \n"; exit; $file_list = $ARGV[0]; #-------------Read the file into arrray-------------# my @fastqc_files; #Array that contains gzipped files use File::Slurp; my @fastqc_files = $file_list; #-------use the zcat over the array contents my $outputfile = "combined.txt" open(my $combined_file, '>', $outputfile) or die "Could not open file +'$outputfile' $!"; for my $fastqc_file (@fastqc_files) { open(IN, sprintf("zcat %s |", $fastqc_file)) or die("Can't open pipe from command 'zcat $fastqc_file' : $!\n" +); while (<IN>) { while ( my $line = IN ) { print $outputfile $line ; } } close(IN); my $Final_combied_zip = new IO::Compress::Gzip($combined_file); or die "gzip failed: $GzipError\n";
Somehow I am not able to get it to run. Can anyone share if there is simpler/ correct method to accomplish this? Thanks!
  • Comment on concatenate/stitch multiple GZip fastq files and output combined gzip file
  • Download Code

Replies are listed 'Best First'.
Re: concatenate/stitch multiple GZip fastq files and output combined gzip file
by BrowserUk (Patriarch) on Dec 02, 2015 at 16:31 UTC

    I'm not sure that there is any real merit in using Perl for something like this? I'd anticipate that a simple command shell pipe line would use less memory and run faster.

    In theory, you ought to be able to do something like:

    gunzip *.fastq.gz | gzip combined.fastq.gz

    You'd need a little appropriate shell syntax to get the list of files from your input file, depending upon your chosen shell. ( <( type filelist.txt ) inplace of the wildcard?)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Or if you have a truly absurd number of input files,

      xargs gunzip < filelist | gzip > combined.gz

        Or, you could just cat the compressed files together. Says gunzip

        Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once.
        ... and conversely, for gzip -c
        If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing them.

      Thanks a lot for your replies - The script works for me now :
      #!/usr/bin/perl use strict; use warnings; use File::Slurp; use IO::Compress::Gzip qw(gzip $GzipError); my @data = read_file('./File_list.txt'); my $out = "./test.txt"; foreach my $data_file (@data) { chomp($data_file); system("zcat $data_file >> $out"); } my $outzip = "./test.gz"; gzip $out => $outzip;
Re: concatenate/stitch multiple GZip fastq files and output combined gzip file
by 1nickt (Canon) on Dec 02, 2015 at 16:05 UTC

    This:

    while (<IN>) { while ( my $line = IN ) { print $outputfile $line ; } }
    should be this:
    while ( my $line = <IN> ) { print $outputfile $line ; }

    But you should post the error you are getting here to give the Monks a chance to help without having to try to get your code to run.

    The way forward always starts with a minimal test.
Re: concatenate/stitch multiple GZip fastq files and output combined gzip file
by kcott (Archbishop) on Dec 03, 2015 at 01:03 UTC

    G'day Bioinfocoder,

    Welcome to the Monastery.

    Here's 'pm_1149165_zip_merge.pl', which appears to do what you want.

    #!/usr/bin/env perl use strict; use warnings; use autodie qw{:all}; open my $gz_pipe, '|-', 'gzip >> pm_1149165_all.gz'; while (<>) { open my $zcat_pipe, '-|', "zcat $_"; print $gz_pipe $_ while <$zcat_pipe>; }

    Here's a sample run (with minimal test data):

    $ zcat pm_1149165_1.gz q w e $ zcat pm_1149165_2.gz a s d $ zcat pm_1149165_3.gz z x c $ cat pm_1149165_list pm_1149165_1.gz pm_1149165_2.gz pm_1149165_3.gz $ ls -l pm_1149165_all.gz ls: pm_1149165_all.gz: No such file or directory $ pm_1149165_zip_merge.pl pm_1149165_list $ ls -l pm_1149165_all.gz -rw-r--r-- 1 ken staff 38 3 Dec 11:44 pm_1149165_all.gz $ zcat pm_1149165_all.gz q w e a s d z x c

    I'll leave you to add in argument checking, usage message, and similar niceties.

    — Ken