I'm having trouble understanding... So, the first data snippet (lots of columns, first column is always "0") is supposed to be read in via the "CG" file handle, and the second data snippet (3 columns, first column is always "chrX") is read in via "INTERVAL", right?

And, if the latter data snippet is "realistic", then your goal is to produce nine distinct text files, with lots of overlapping content across those nine files - for example, if a given line from the CG input has a value of 999999 in column 4, it will be included in all nine outputs, right? (Because "999999" falls within the range for all nine intervals in that data snippet.)

If I have that right, then I think you'll be better off if you read the interval data first, create a hash containing file handles for the intervals along with their min and max values. Then read the CG data; as you look at each CG record, loop over the hash of intervals and print to each of the file handles where it belongs. Something like:

# set up your path strings for the two input files... then: my %intervals; open( INTERVALS, $interval_path ) or die "$interval_path: $!\n"; while (<INTERVALS>) { chomp; my ( $str, $min, $max ) = split; next unless ( $min =~ /^\d+$/ and $max =~ /^\d+$/ ); my $out_path = "...."; # whatever makes a good name for this outp +ut... open( my $ofh, '>', $out_path ) or die "$out_path: $!\n"; $intervals{$out_path} = { 'min' => $min, 'max' => $max, 'fh' => $o +fh }; } open( CG, $cg_path ) or die "$cg_path: $!\n"; while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { print { $intervals{$output}{'fh'} } $_; } } }
(not tested)

UPDATE: If your "intervals" list is really a lot longer than the 9-line snippet that you showed us, you might not be able to have that many output file handles open at once. In that case, move the open statement out of the first while loop (don't store an 'fh' element in the hash), and put it just before the print statement of the second loop (and change to append mode) - i.e.:

... while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { open( my $ofh, '>>', $output ) or die "$out_path: $!\n"; print $ofh $_; } } }
By using a lexical scalar variable for the file handle, it will be closed automatically at each iteration, which is what you would want in this case.

In reply to Re: Sorting Data By Overlapping Intervals by graff
in thread Sorting Data By Overlapping Intervals by ccelt09

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.