in reply to Query large tab delimited file by a list

HI Elninh05! It would be of enormous help if you used <code>...</code> tags for the column formats of these files. As written, your post is hard to understand. If you format it better, you will certainly get better answers.

Replies are listed 'Best First'.
Re^2: Query large tab delimited file by a list
by Elninh05 (Novice) on Jul 03, 2016 at 16:06 UTC
    Hi Marshall, thanks for the tip. Im new to perl and the community. But hope can improve my perl skills more and more.
      I see that file 1 is humongous (6 GB). How big is file 2? I guess output 1 is "extract records from file 1 that match an id in file2?". I am unclear as to the algorithm for output 2.

      Have you tried any code yet? If so post it and your thoughts on algorithms.

      Update: The size of file 2 matters in terms of whether this can be kept in memory or not. If so, this output 1 is relatively easy. If not, then some pre-sorting or a DB approach would be necessary.

        The first file is really big. The second file is about 200 MB with a single column (id). Yah, the script should extract records from file 1 that match an id in file 2. The second output should be a kind of statistics but is not important yet. I have about 32 GB of RAM and I would like to avoid the use of databases because I do not have any experiences with them. So Im new to perl but Im able to read in the file Formats into perl but however for extraction of the ids I have no clue yet. Would be glad if you can help.
        Code #!/usr/bin/env perl use strict; use warnings; #Variable my $file1 = '/home//Desktop/file1.txt'; my $file2 = '/home/Desktop/file2.txt'; #Filehandle open( my $FH , '<', $file1 ) or die "Couldn't open file \"$file1\": $! +\n"; open( my $FH , '<', $file2 ) or die "Couldn't open file \"$file2\": $! +\n"; #Program for reading dbsnp my @file1_rows = split ("\t", $file1); ...
        The first file is really big. The second file is about 200 MB with a single column (id). Yah, the script should extract records from file 1 that match an id in file 2. The second output should be a kind of statistics but is not important yet. I have about 32 GB of RAM and I would like to avoid the use of databases because I do not have any experiences with them. So Im new to perl but Im able to read in the file Formats into perl but however for extraction of the ids I have no clue yet. Would be glad if you can help. May be the use of hash keys is the right Approach???
        Code #!/usr/bin/env perl use strict; use warnings; #Variable my $file1 = '/home/Desktop/file1.txt'; my $file2 = '/home/Desktop/file2.txt'; #Filehandle open( my $FH , '<', $file1 ) or die "Couldn't open file \"$file1\": $! +\n"; open( my $FH , '<', $file2 ) or die "Couldn't open file \"$file2\": $! +\n"; #Program for comparison my @file1_rows = split ("\t", $file1); ...