TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently processing large .txt files full of biological data, and in an effort to reduce the size of all resulting, downstream files i'm looking to split everything by chromosome (i.e. 1 file for each chromosome) as opposed to printing everything into single files.

I have a list of chromosome labels, such as 1 2 3 4 5 6 7 8 9 10 that can be supplied to the script and used to generate unique filenames e.g. 1_Sample_Info.txt and 2_Sample_Info.txt. Each line of the input data that is processed contains a chromosome label that ends up in a split array ($F[0]), allowing it to be diverted toward the correct output file.

However, I am struggling to figure out how to best open the files in the first place. As of right now, files are opened with:

open $OUT, '>', "$subDir/$outfile" or die "$!";

And printed to with:

print $OUT "Example\n";

The number of chromosomes and their labels will often differ, so the files cannot be explicitly specified. Therefore, I think the files may have to be generated within a loop that iterates over the supplied list - but - how can I generate a unique filehandle for each output file to then later use with print statements? Another issue is that the chromosome labels can often be numeric (as above) so if used directly as filehandles or variables, they clash with global variables e.g. $1.

Any suggestions or examples would be greatly appreciated. If at all possible i'd like to avoid use of any non-core modules.

Replies are listed 'Best First'.
Re: Opening multiple output files within a loop
by haukex (Archbishop) on Dec 19, 2017 at 11:52 UTC
    use warnings; use strict; my @labels = qw/ foo bar quz /; my %fh; for my $label (@labels) { my $filename = "out_${label}.txt"; open $fh{$label}, '>', $filename or die "$filename: $!"; } while (<DATA>) { # or whatever condition you like, just a demo here if (/lbl_(foo|bar|quz)/) { print { $fh{$1} } $_; # how to print to one of these handles } else { print $_ } # just a default action, adjust as needed } close $_ for values %fh; __DATA__ Hello lbl_foo lbl_bar World Test lbl_quz lbl_foo A No label here
Re: Opening multiple output files within a loop
by hippo (Archbishop) on Dec 19, 2017 at 11:40 UTC
    how can I generate a unique filehandle for each output file to then later use with print statements?

    Store all the filehandles as values in a single hash which is keyed on the unique names (whether you use the actual filenames as keys or something symbolic is up to you). This way you can (a) pick a filehandle by key at any time and (b) iterate over all of them should the need arise.

Re: Opening multiple output files within a loop
by Laurent_R (Canon) on Dec 19, 2017 at 18:15 UTC
    Using a hash (or possibly an array), as suggested by other monks above, for storing file handles is the easiest solution.

    Just remember that, in order to print to a filehandle stored in a hash you need to put the handle within curly brackets. For example, if %fh is your hash of filehandles and if your chromosome id is 5, the your need to do this:

    print {$fh{5}} "to be printed\n";

    Another possible option to solve your problem is to use a dispatch table, i.e. to store in a hash anonymous functions printing to their own file handle (I have no environment on my mobile device to test this code right now).

    my %dispatch: sub create_function { my $id = shift; open my $fh, ">", "file_nr_$id" or die "... $!"; return sub { my $line = shift; print $fh $line; } } $dispatch{$_} = create_functions($_) for @list_of_chromosomes; # ... # later when reading the file, assuming you have obtained a $line and +an id $id from the input: $dispatch{id}->($line);
    The create_function subroutine is something that is sometimes called a function factory; it generates anonymous subroutines which close over their own filehandle. This anonymous subroutines are returned to the caller and stored in the %dispatch hash. Then, when reading the input, you just call the anonymous subroutine stored in the hash.
Re: Opening multiple output files within a loop
by Anonymous Monk on Dec 19, 2017 at 21:39 UTC
    What I would generally do in this case is to "start with a presumed-empty directory that has been specified by the user." Now, each time the chromosome changes, open the appropriate target file in append ">>" mode. Yes, it will be necessary for the user to be sure in-advance that the target directory is empty, but that should be fine. The program will now open each file (if it already exists) positioned at the end-of-file, or will create the file if it is new.
      Hm, I am not sure to understand. Do you mean that you want to run an open statement in append mode for each line with a different chromosome in the input file? That would be very inefficient, as opening a file takes some time. You might want to dynamically open files only if you need to, but then you really want to open files only if they have not been opened before. From the OP, however, it seems that the program is receiving the list of chromosomes used in the file as an argument to the program.