comment on

Sorry, I don't completely understand what you want to do and what your data really looks like. Looking at file 1:

contig1 10037203    10038203    blah
contig1    10037203    10038203    blah
contig1    10037203    10038203    blah
[download]

When you have a range starting with 10037203, in file 1, is it the case that all lines of file 1 where the range starts with 10037203 will also complete with the same and or range value (10038203 in the example)? If such is the case, you start by reading file 1 and store in a hash the start value (as the key) and the end value (as the value). At the same time, also store all the lines with the same range start value (in another hash, for example, or in the same hash if you know how to use a hash of hashes).

Then, you read file 2, for each line, get the current start and end values and look for all keys of the first hash created above:

- which are between the start and end values of the current line (cases 3 and 4 in the scheme below);

- where the start value (key) is below the current start and the end (value) above the current start value (cases 1 and 2 below);

- etc.

It is probably easier to get it right with a graphic representation. You basically have four possible cases of overlap:

F1:   |-----|
F2:      |-----|

F1:   |-----|
F2:     |--|

F1:   |-----|
F2: |-----|

F1:   |-----|
F2:  |---------|
[download]

If you find an overlap, print the stored lines from file 1 and then all lines of file 2 which have the same start point.

If your files are large, this might become time consuming, but it might be optimized greatly if you know something more about the maximum possible ranges.

Well, these are quick guidelines on a possible solution, I haven't worked out all the details, but I think it should more or less work.

Another possible and completely different approach would be to sort both files in accordance to their start values and then read them in parallel. But I should warn you that this is not exactly as easy as it might look at first glance, you'll find a number of special cases to take into account.

In reply to Re: output the overlapping regions to new file by Laurent_R
in thread output the overlapping regions to new file by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.