Query large tab delimited file by a list

Elninh05 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Query large tab delimited file by a list by choroba (Cardinal) on Jul 03, 2016 at 16:39 UTC
The file1 is probably too large to remember the ids, but you haven't told us how large file2 is. I guess it's much smaller, so you could try hashing the ids from it and then reading the large file line by line and check whether the id was mentioned in file2: #!/usr/bin/perl use warnings; use strict; my %ids; open my $IDS, '<', 'file2' or die $!; while (<$IDS>) { chomp; $ids{$_} = 1; } open my $OUT1, '>', 'output1' or die $!; open my $LARGE, '<', 'file1' or die $!; my $count_matches = 0; while (<$LARGE>) { my ($pos, $start, $end, $id) = split; ++$count_matches, print {$OUT1} $_ if $ids{$id}; } close $OUT1 or die $!; open my $OUT2, '>', 'output2' or die $!; print {$OUT2} "Number_pos_1st_file\t", $LARGE->input_line_number, "\n" +; print {$OUT2} "Number_pos_2nd_file\t", $IDS->input_line_number, "\n"; print {$OUT2} "Nr_Matching\t", $count_matches, "\n"; print {$OUT2} "Nr_Non_matching\t", $IDS->input_line_number - $count_matches, "\n"; close $OUT2 or die $!; [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Query large tab delimited file by a list by GrandFather (Saint) on Jul 03, 2016 at 21:37 UTC
How often does this sort of task need to be done? If the answer is more than once with similar data then you should seriously consider using a database to do the heavy lifting. To get your eye in take a look at Databases made easy. Premature optimization is the root of all job security	[reply]
Re: Query large tab delimited file by a list by BrowserUk (Patriarch) on Jul 03, 2016 at 17:17 UTC
This will take care of the first output file quickly and easily: `#! perl -slw use strict; use Inline::Files; my $tally = ''; m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) = 1 while <FILE2>; m[rs(\d{7})] and vec( $tally, $1 - 1e6, 1 ) and print while <FILE1>; __DATA__ __FILE1__ chr1 11223 11224 rs2342349 chr2 23423 23424 rs6345435 chr3 64564 64565 rs3432456 chr4 56456 56457 rs7979979 __FILE2__ rs2342349 rs3274234 rs2342344` [download] The second output file makes no sense. The position of what in the first file? The position of what in the second? Number of matching what? Number of non-matching what? Is that 4 lines (including all the boiler-plate text) for every line in file 1; or file 2? Or every line in file 2 that was matched? Or... With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.	[reply] [d/l]
Re: Query large tab delimited file by a list by hippo (Archbishop) on Jul 04, 2016 at 09:06 UTC
I see no reason to avoid the obvious tool: grep. `$ grep -F -f file2 file1 > outputfile` [download] This should construct your first output file and likely do it faster than perl (benchmarking left as an exercise). If you can come up with an algorithm for your second output file then we can look at that in more detail.	[reply] [d/l]
Re: Query large tab delimited file by a list by Marshall (Canon) on Jul 03, 2016 at 16:00 UTC
HI Elninh05! It would be of enormous help if you used `<code>...</code>` tags for the column formats of these files. As written, your post is hard to understand. If you format it better, you will certainly get better answers.	[reply] [d/l]
Re^2: Query large tab delimited file by a list by Elninh05 (Novice) on Jul 03, 2016 at 16:06 UTC
Hi Marshall, thanks for the tip. Im new to perl and the community. But hope can improve my perl skills more and more.	[reply]
Re^3: Query large tab delimited file by a list by Marshall (Canon) on Jul 03, 2016 at 16:39 UTC
I see that file 1 is humongous (6 GB). How big is file 2? I guess output 1 is "extract records from file 1 that match an id in file2?". I am unclear as to the algorithm for output 2. Have you tried any code yet? If so post it and your thoughts on algorithms. Update: The size of file 2 matters in terms of whether this can be kept in memory or not. If so, this output 1 is relatively easy. If not, then some pre-sorting or a DB approach would be necessary.	[reply]
Re^4: Query large tab delimited file by a list by Elninh05 (Novice) on Jul 03, 2016 at 17:27 UTC
Re^5: Query large tab delimited file by a list by Marshall (Canon) on Jul 03, 2016 at 17:55 UTC
Re^4: Query large tab delimited file by a list by Elninh05 (Novice) on Jul 03, 2016 at 17:28 UTC
Re: Query large tab delimited file by a list by pme (Monsignor) on Jul 03, 2016 at 16:33 UTC
Hi Elninh05, How big is the another file containing only ids in one column?	[reply]
Re^2: Query large tab delimited file by a list by Elninh05 (Novice) on Jul 03, 2016 at 18:11 UTC
Hi pme, the second file would be approx. 200 MB.	[reply]
Re: Query large tab delimited file by a list by Elninh05 (Novice) on Jul 04, 2016 at 19:33 UTC
Dear Monks, Im very grateful for your help! By the script of choroba I could solve the Problem in a efficient computational way. By only grepping both files I run out of memory and the command dies. I thank every one again for your help!	[reply]