Re^4: Query large tab delimited file by a list

The first file is really big. The second file is about 200 MB with a single column (id). Yah, the script should extract records from file 1 that match an id in file 2. The second output should be a kind of statistics but is not important yet. I have about 32 GB of RAM and I would like to avoid the use of databases because I do not have any experiences with them. So Im new to perl but Im able to read in the file Formats into perl but however for extraction of the ids I have no clue yet. Would be glad if you can help.

 Code

#!/usr/bin/env perl

use strict;
use warnings;

#Variable
my $file1 = '/home//Desktop/file1.txt';
my $file2 = '/home/Desktop/file2.txt';

#Filehandle
open( my $FH , '<', $file1 ) or die "Couldn't open file \"$file1\": $!
+\n";
open( my $FH , '<', $file2 ) or die "Couldn't open file \"$file2\": $!
+\n";

#Program for reading dbsnp
my @file1_rows = split ("\t", $file1);

...
[download]

Comment on Re^4: Query large tab delimited file by a list Download Code

Replies are listed 'Best First'.
Re^5: Query large tab delimited file by a list by Marshall (Canon) on Jul 03, 2016 at 17:55 UTC
`to continue on: (untested) open( my $FH1 , '<', $file1 ) or die "Couldn't open file \"$file1\": $ +!"; open( my $FH2 , '<', $file2 ) or die "Couldn't open file \"$file2\": $ +!"; my %ids; while <my $id = <$FH2>) { chomp $id; #remove line ending $ids{$id} = 1; } my $line = <$FH1>; #throw away first header line while ($line = <$FH1>) { #get 'rs2342349' from: "chr1 11223 11224 rs2342349\n" my ($id) = (split /\s+/,$line)[3]; #whitespace chars also includes + tabs print $line if exists $ids{$id}; }` [download] Update: You will notice that I removed the "\n" from the "die" statement. die will put an \n in by default. If you explicitly put in an \n that changes what the "die" prints! Whoa! Here is a short demo: `#open IN, '<', 'somename' or die "xxx $!\n"; #prints xxx No such file or directory open IN, '<', 'somename' or die "xxx $!"; #prints xxx No such file or directory at C:\Projects_Perl\testing\junk +.pl line 4.` [download] Update 2: RE: the statistics If "Number_pos_1st_file" is just the line count, then that is easy. If these pos values are not unique, then I see problems because the file is so large that it is likely that a hash to count them won't fit into memory. In that case, I would do a system sort on the file and then read through it to find the unique pos values. "Number_pos_2nd_file" is just keys %ids? Or perhaps it is the line count?	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: Query large tab delimited file by a list
by Marshall (Canon) on Jul 03, 2016 at 17:55 UTC

to continue on: (untested)
open( my $FH1 , '<', $file1 ) or die "Couldn't open file \"$file1\": $
+!";
open( my $FH2 , '<', $file2 ) or die "Couldn't open file \"$file2\": $
+!";

my %ids;
while <my $id = <$FH2>)
{
   chomp $id;  #remove line ending
   $ids{$id} = 1;
}

my $line = <$FH1>; #throw away first header line
while ($line = <$FH1>)
{
    #get 'rs2342349' from: "chr1  11223   11224  rs2342349\n"
    
    my ($id) = (split /\s+/,$line)[3]; #whitespace chars also includes
+ tabs
    print $line if exists $ids{$id};
}
[download]

Update:

#open IN, '<', 'somename' or die "xxx $!\n";
#prints xxx No such file or directory

open IN, '<', 'somename' or die "xxx $!";
#prints xxx No such file or directory at C:\Projects_Perl\testing\junk
+.pl line 4.
[download]

Update 2:

If "Number_pos_1st_file" is just the line count, then that is easy. If these pos values are not unique, then I see problems because the file is so large that it is likely that a hash to count them won't fit into memory. In that case, I would do a system sort on the file and then read through it to find the unique pos values.

"Number_pos_2nd_file" is just keys %ids? Or perhaps it is the line count?

[reply]
[d/l]
[select]