comment on

I'd like to create a Perl program that accepts as input one file of genome loci (potentially 10's of GB in size) and one region file (typically around 1GB), and I'm trying to think of the best approach to filter the loci file based on the coordinates in the region file.

The loci file looks like this (the lines are truncated here; the important columns are the first two: chromosome and position, although I'd keep the whole like for each match):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    IN
1    874816    .    C    CT    10    FINAL_AMBG
1    1647893    .    C    CTTTCTT    30    FINAL_NOT_
1    7889972    rs57875989    GAGAATCCATCCCATCCTACTGCCAG
1    14106394    .    A    ACTC    100    FI
1    22332156    .    AGG    A    10    FI
1    22332161    .    T    TC    0    FI
[download]

The region file looks like this:

1    69091    69290
1    69291    69490
1    69491    69690
1    69691    70008
1    861321    861393
1    865535    865716
1    866418    866469
[download]

The first column is the chromosome, the second column is the start coordinate, and the third column is the end coordinate.

What I'd like to do is only keep the lines from the first (loci) file whose 'POS' is between the start and end coordinates on the second (region) file.

A nested while loop would seem to be grossly inefficient, so I was thinking of building a hash of arrays. But before I do, I'd like to draw on the wisdom of the Monks on whether there's a better approach. Execution speed and memory efficiency are paramount. Suggestions much appreciated.

In reply to Best approach for large-scale data processing by iangibson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.