When you have a range starting with 10037203, in file 1, is it the case that all lines of file 1 where the range starts with 10037203 will also complete with the same and or range value (10038203 in the example)? If such is the case, you start by reading file 1 and store in a hash the start value (as the key) and the end value (as the value). At the same time, also store all the lines with the same range start value (in another hash, for example, or in the same hash if you know how to use a hash of hashes).contig1 10037203 10038203 blah contig1 10037203 10038203 blah contig1 10037203 10038203 blah
Then, you read file 2, for each line, get the current start and end values and look for all keys of the first hash created above:
- which are between the start and end values of the current line (cases 3 and 4 in the scheme below);
- where the start value (key) is below the current start and the end (value) above the current start value (cases 1 and 2 below);
- etc.
It is probably easier to get it right with a graphic representation. You basically have four possible cases of overlap:
If you find an overlap, print the stored lines from file 1 and then all lines of file 2 which have the same start point.F1: |-----| F2: |-----| F1: |-----| F2: |--| F1: |-----| F2: |-----| F1: |-----| F2: |---------|
If your files are large, this might become time consuming, but it might be optimized greatly if you know something more about the maximum possible ranges.
Well, these are quick guidelines on a possible solution, I haven't worked out all the details, but I think it should more or less work.
Another possible and completely different approach would be to sort both files in accordance to their start values and then read them in parallel. But I should warn you that this is not exactly as easy as it might look at first glance, you'll find a number of special cases to take into account.
In reply to Re: output the overlapping regions to new file
by Laurent_R
in thread output the overlapping regions to new file
by Anonymous Monk
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |