Using hash keys to separate data

a217 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perlmonks,

I have a file (hashKey.txt) that I would like to use as a list of hash keys in order to separate data from input files (e.g. testReg.txt).

hashKey.txt

chr10
chr10_random
chr11
chr11_gl000202_random
chr11_random
chr12
chr13
chr13_random
chr14
chr15
chr15_random
chr16
chr16_random
chr17_ctg5_hap1
chr17
chr17_gl000203_random
chr17_gl000204_random
chr17_gl000205_random
chr17_gl000206_random
chr17_random
chr18
chr18_gl000207_random
chr18_random
chr19
chr19_gl000208_random
chr19_gl000209_random
chr19_random
chr1
chr1_gl000191_random
chr1_gl000192_random
chr1_random
chr20
chr21
chr21_gl000210_random
chr21_random
chr22
chr22_h2_hap1
chr22_random
chr2
chr2_random
chr3
chr3_random
chr4_ctg9_hap1
chr4
chr4_gl000193_random
chr4_gl000194_random
chr4_random
chr5
chr5_h2_hap1
chr5_random
chr6_apd_hap1
chr6_cox_hap1
chr6_cox_hap2
chr6_dbb_hap3
chr6
chr6_mann_hap4
chr6_mcf_hap5
chr6_qbl_hap2
chr6_qbl_hap6
chr6_random
chr6_ssto_hap7
chr7
chr7_gl000195_random
chr7_random
chr8
chr8_gl000196_random
chr8_gl000197_random
chr8_random
chr9
chr9_gl000198_random
chr9_gl000199_random
chr9_gl000200_random
chr9_gl000201_random
chr9_random
chrM
chrUn_gl000211
chrUn_gl000212
chrUn_gl000213
chrUn_gl000214
chrUn_gl000215
chrUn_gl000216
chrUn_gl000217
chrUn_gl000218
chrUn_gl000219
chrUn_gl000220
chrUn_gl000221
chrUn_gl000222
chrUn_gl000223
chrUn_gl000224
chrUn_gl000225
chrUn_gl000226
chrUn_gl000227
chrUn_gl000228
chrUn_gl000229
chrUn_gl000230
chrUn_gl000231
chrUn_gl000232
chrUn_gl000233
chrUn_gl000234
chrUn_gl000235
chrUn_gl000236
chrUn_gl000237
chrUn_gl000238
chrUn_gl000239
chrUn_gl000240
chrUn_gl000241
chrUn_gl000242
chrUn_gl000243
chrUn_gl000244
chrUn_gl000245
chrUn_gl000246
chrUn_gl000247
chrUn_gl000248
chrUn_gl000249
chrX
chrX_random
chrY
[download]

hashKey.txt give a list of all the possible chromosome values there could be in a given input file

testReg.txt

chr1    100    159    0
chr1    200    260    0
chr1    500    750    0
chr3    450    700    0
chr4    100    300    0
chr7    350    600    0
chr9    100    125    0
chr11    679    687    0
chr22    100    200    0
chr22    300    400    0
[download]

testReg.txt is simply a test file I use to test the code. It includes various chromosome values along with 3 other columns of data.

My code so far:

#!/usr/bin/perl
use warnings; use strict;

my (%Chr, %R);
my (@key_split, @reg_split);
my ($reg_line);

open(KEY, "<hashKey.txt") or die "error reading key list";
open(REG, "<testReg.txt") or die "error reading file";

while (<KEY>) {

    chomp;
    @key_split = split("\n");
    $Chr{"$key_split[0]"} = $key_split[0];
}

while (<REG>) {

    chomp;
    @reg_split = split("\t");
    #$R{"$reg_split[0]"} = ($reg_split[0], $reg_split[1], $reg_split[2
+], $reg_split[3]);
    $R{"$reg_split[0]"} = $reg_split[0];
}


foreach my $key (keys %Chr) {
    if(exists($R{$key})){
        print ("$R{$key}\n");
    }
}
close(KEY);
close(REG);
[download]

So far, my code prints out all of the chr values in common between hashKey.txt and testReg.txt. What I would like it to do is to print each line to a separate file designated by each chromosome. For example:

chr1.out

chr1    100    159    0
chr1    200    260    0
chr1    500    750    0
[download]

chr3.out

chr3    450    700    0
[download]

chr4.out

chr4    100    300    0
[download]

chr7.out

chr7    350    600    0
[download]

chr9.out

chr9    100    125    0
[download]

chr11.out

chr11    679    687    0
[download]

chr22.out

chr22    100    200    0
chr22    300    400    0
[download]

From there I can use each separated file to sort what I need to. I suppose my main problem is trying to figure out how to have the hash variable point toward the unique line. Is what I am trying to accomplish even possible with hash table given that the key could be used for multiple lines? My main goal is to just separate each chr from the input file (testReg.txt) into separate files. If you have any suggestions please let me know.

Comment on Using hash keys to separate data Select or Download Code

Replies are listed 'Best First'.
Re: Using hash keys to separate data by wfsp (Abbot) on Jun 29, 2011 at 06:04 UTC
Nearly there. :-) `#!/usr/bin/perl use warnings; use strict; open(KEY, "<hashKey.txt") or die "error reading key list"; open(REG, "<testReg.txt") or die "error reading file"; my %Chr; while (my $key = <KEY>) { chomp $key; $Chr{$key} = undef; } my %R; while (my $reg = <REG>) { chomp $reg; my @reg_split = split("\t", $reg); push @{$R{$reg_split[0]}}, $reg; } foreach my $key (sort keys %R) { next unless exists $Chr{$key}; for my $out (@{$R{$key}}){ print "$out\n"; } print q{-} x 20, qq{\n}; } close(KEY); close(REG);` [download] `chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 -------------------- chr11 679 687 0 -------------------- chr22 100 200 0 chr22 300 400 0 -------------------- chr3 450 700 0 -------------------- chr4 100 300 0 -------------------- chr7 350 600 0 -------------------- chr9 100 125 0 --------------------` [download] The first `while` loop creates a lookup table (`%Chr`). The source file only has 1 field per record so there is no need for the `split`. The second `while` loop creates a hash of arrays (`%R`) from your input file. The key is the first field (chromosome) and the value is an array of records. That's what the `push` is doing. Finaly we print the records for each chromosome if it exists in the lookup table. In your case you want to print to a file rather than STDOUT as we do here. As an aside, you could rewrite the first `while` loop with `map`. Hope that helps. Update Reading your question again I see hashKey.txt gives a list of all the possible chromosome values there could be in a given input file. If that is the case why do you need the lookup table? I could see it being useful if there could be values in your input that you weren't interested in.	[reply] [d/l] [select]
Re: Using hash keys to separate data by bart (Canon) on Jun 29, 2011 at 07:09 UTC
If you split the lines into 2 parts, instead of in as many as the line contains, then you'll keep the entire row. Also, you can assign to a list of scalars, whih is easier to handle than an array. And for the rest, as wfsp already said: push the data onto the anonymous array which comprise the values of the hash (autovivified, so don't worry about the anonymous array not existing). `while (<REG>) { chomp; my($key, $data) = split "\t", $_, 2; push @{$R{$key}}, $data; }` [download] After that it's just a matter of looping through the keys, and print out the contents of the array. `foreach my $key (keys %R) { open my $fh, '>', "$key.out" or die "Cannot open file $key.out: $! +"; foreach my $row (@{$R{$key}}) { print $fh "$key\t$row\n"; } }` [download]	[reply] [d/l] [select]
Re: Using hash keys to separate data by Marshall (Canon) on Jun 29, 2011 at 09:07 UTC
My main goal is to just separate each chr from the input file (testReg.txt) into separate files. If you have any suggestions please let me know. You are over-thinking this. You don't even need the file: hashKey.txt. The file testReg.txt is I think the 15GB monster file. If this file is not already sorted, use the system command line sort to do that. The command line sort can sort things way bigger than the size of memory. Now all of the lines that have the same chromosome will be grouped together in the file. We just read the file and every time we switch to a new chromosome, we start a new file. `#!/usr/bin/perl -w use strict; my $curr_chrom = ""; while (<DATA>) { my ($chrom) = split; # $chrom is the first column # parens on the left side are needed # for list context if ($chrom ne $curr_chrom) { $curr_chrom = $chrom; open (OUT, '>', "$curr_chrom.out") or die "unable to write $curr_chrom.out $!\n"; } print OUT; } close OUT; __DATA__ chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 chr3 450 700 0 chr4 100 300 0 chr7 350 600 0 chr9 100 125 0 chr11 679 687 0 chr22 100 200 0 chr22 300 400 0` [download] A few notes: If a file handle is open to one file and it is used again and opened to another file, the first file is closed automatically (no need to close it explicitly). For your data, normally you want to split on any series of white space characters split(/\s+/,$_) is the "default" split and is what is used by: $chrom = split;. Trying to split on \t is probably and certainly \n is not what you want. Update: From the wording of the post, I don't think that you are interested in a subset of the chromosomes in the input file, but if you were, then here's how. Make a hash table with keys being the chromosomes that you want. In the above program, when the chromosome changes (the if statement), test if the chromosome is on the "approved" list (name exists in the hash table) or not. If it does exist, then open OUT to that name like above, if it does not, then open OUT to "/dev/null". /dev/null is a special device that discards all stuff written to it (it is the "bit bucket"). That way you always execute the print OUT; statement. Sometimes it goes somewhere useful and sometimes into the black hole of bits. To make the hash, your code: `while (<KEY>) { chomp; @key_split = split("\n"); $Chr{"$key_split[0]"} = $key_split[0]; } ## better written as: ## while (<KEY>) { my ($chrom) = split; $Chr{$chrom}=1; }` [download]	[reply] [d/l] [select]
Re^2: Using hash keys to separate data by a217 (Novice) on Jun 29, 2011 at 15:51 UTC
Marshall, I suppose I was over-thinking it. Your method looks to read in constant time, and for a large input file that I'm working with I think that may be beneficial. The only reason I included the key list is because I thought that would be the easiest way to separate the input data into separate files. However, the input data is already well-sorted so your method should work. One more question in general: with my code and the suggestions everyone has given, there is still an error message (despite the fact that the output is correct). Is there any way to get rid of this error message or is it just something I am going to have to deal with? The message refers to uninitialized value errors, and I was trying to fix this before. However, I suppose if the output is still correct that is the only thing that matters.	[reply]
Re^3: Using hash keys to separate data by Marshall (Canon) on Jun 29, 2011 at 22:43 UTC
Yes, once sorted, the algorithm just reads the file once in a linear fashion. So this should be great for your humongous file. The warning message should give you a line number in the code and often you also get the line number of the input file. One common way to get an uninitialized value is when there is a blank line in the file - this causes the split to fail (no results). An extra carriage return is easy to get missed since they are "invisible". I often put: next if /^\s*$/; which will go to the next input line if the current line contains nothing but white spaces. I think that I already mentioned that normally you probably should be splitting on the regex /\s+/ which is the default. white space (\s) includes all of the following: the space of course,\n\r\f\t any contiguous sequence of those gets removed. Splitting on just tab characters (\t) can cause problems if there are sometimes extra space characters in there that you cannot see with the editor. I think you are on the right track - keep at it!	[reply]