Re: Using hash keys to separate data

My main goal is to just separate each chr from the input file (testReg.txt) into separate files. If you have any suggestions please let me know.

You are over-thinking this. You don't even need the file: hashKey.txt. The file testReg.txt is I think the 15GB monster file. If this file is not already sorted, use the system command line sort to do that. The command line sort can sort things way bigger than the size of memory.

Now all of the lines that have the same chromosome will be grouped together in the file. We just read the file and every time we switch to a new chromosome, we start a new file.

#!/usr/bin/perl -w
use strict;

my $curr_chrom = "";

while (<DATA>)
{
   my ($chrom) = split;  # $chrom is the first column
                         # parens on the left side are needed
                         # for list context
   if ($chrom ne $curr_chrom)
   {
       $curr_chrom = $chrom;
       open (OUT, '>', "$curr_chrom.out") 
          or die "unable to write $curr_chrom.out $!\n";
   }
   print OUT;
}
close OUT;


__DATA__
chr1    100 159 0
chr1    200 260 0
chr1    500 750 0
chr3    450 700 0
chr4    100 300 0
chr7    350 600 0
chr9    100 125 0
chr11   679 687 0
chr22   100 200 0
chr22   300 400 0
[download]

A few notes: If a file handle is open to one file and it is used again and opened to another file, the first file is closed automatically (no need to close it explicitly). For your data, normally you want to split on any series of white space characters split(/\s+/,$_) is the "default" split and is what is used by: $chrom = split;. Trying to split on \t is probably and certainly \n is not what you want.

Update: From the wording of the post, I don't think that you are interested in a subset of the chromosomes in the input file, but if you were, then here's how. Make a hash table with keys being the chromosomes that you want. In the above program, when the chromosome changes (the if statement), test if the chromosome is on the "approved" list (name exists in the hash table) or not. If it does exist, then open OUT to that name like above, if it does not, then open OUT to "/dev/null". /dev/null is a special device that discards all stuff written to it (it is the "bit bucket"). That way you always execute the print OUT; statement. Sometimes it goes somewhere useful and sometimes into the black hole of bits.

To make the hash, your code:

while (<KEY>) {

    chomp;
    @key_split = split("\n");
    $Chr{"$key_split[0]"} = $key_split[0];
}
## better written as: ##
while (<KEY>) {
    my ($chrom) = split;
    $Chr{$chrom}=1;
}
[download]

Comment on Re: Using hash keys to separate data Select or Download Code

Replies are listed 'Best First'.
Re^2: Using hash keys to separate data by a217 (Novice) on Jun 29, 2011 at 15:51 UTC
Marshall, I suppose I was over-thinking it. Your method looks to read in constant time, and for a large input file that I'm working with I think that may be beneficial. The only reason I included the key list is because I thought that would be the easiest way to separate the input data into separate files. However, the input data is already well-sorted so your method should work. One more question in general: with my code and the suggestions everyone has given, there is still an error message (despite the fact that the output is correct). Is there any way to get rid of this error message or is it just something I am going to have to deal with? The message refers to uninitialized value errors, and I was trying to fix this before. However, I suppose if the output is still correct that is the only thing that matters.	[reply]
Re^3: Using hash keys to separate data by Marshall (Canon) on Jun 29, 2011 at 22:43 UTC
Yes, once sorted, the algorithm just reads the file once in a linear fashion. So this should be great for your humongous file. The warning message should give you a line number in the code and often you also get the line number of the input file. One common way to get an uninitialized value is when there is a blank line in the file - this causes the split to fail (no results). An extra carriage return is easy to get missed since they are "invisible". I often put: next if /^\s*$/; which will go to the next input line if the current line contains nothing but white spaces. I think that I already mentioned that normally you probably should be splitting on the regex /\s+/ which is the default. white space (\s) includes all of the following: the space of course,\n\r\f\t any contiguous sequence of those gets removed. Splitting on just tab characters (\t) can cause problems if there are sometimes extra space characters in there that you cannot see with the editor. I think you are on the right track - keep at it!	[reply]