I have two bioinformatics problems involving GFF3 and BED file formats, with no obvious off the shelf solution from BedTools or BedOps. The problems are:
Problem 1. Merge coordinates between two BED files, but not within them, at less than or equal to user-specified distance of separation
Problem 2. Chain coordinates between two BED files, at less than or equal to user-specified distance of separation
I have not written code yet, because I wonder if there are Perl or even BioPerl modules to process generic numerical intervals, or more specifically genomic intervals. Are there? I could not find any. I provide an example below for expected solution to problems 1 and 2. Thank you!
File 1 - Genomic or numerical coords of feature type A - A1, A2, ... An
File 2 - Genomic or numerical coords of feature type B - B1, B2, ...Bm
Let's say the concatenated Files 1 and 2, followed by sort looks like this:
Chr1 1000 4000 A1 Chr1 12000 18500 B1 Chr1 15000 22000 A2 Chr1 28000 29000 B2 Chr1 30000 32000 A3 Chr1 42000 44000 A4
Problem 1 - Report merged coordinates for pairs of AB separated by not more than 10KB
note that the merged intervals cannot be for consecutive features of ONLY As or Bs, it MUST be pairs of A+B, i.e. only one of each of A and B, OR they can be singletons of A or B that are UNpaired
Chr1 1000 4000 A1 Chr1 12000 22000 B1,A2 Chr1 28000 32000 B2,A3 Chr1 42000 44000 A4
# In the last line of the expected solution shown above, though separated by 10KB, which is user-specified limit, A4 cannot be paired with A3, since the pairs have to be strictly ABs
# A1 and B1 are not merged here because B1 and A2 overlap, and have no distance of separation between them, and therefore, the pairing of B1 with A2 is prioritized over pairing of B1 with A1
Problem 2 - Report coordinates for collapsed or chained intervals which contain As and Bs, component pairs of which are not separated by more than 7KB, in the example result below.
Here the "chained" coords may be singletons that cannot be chained due to distance, or when chained, can be runs of the same type, i.e., A1,A2 or B1,B2,B3
Chr1 1000 4000 A1 Chr1 12000 32000 B1,A2,B2,A3 Chr1 42000 44000 A4
# In the last line of expected solution shown above, separation is 10KB from A3, else it would've been a longer "chain" of B1,A2,B2,A3,A4
In reply to Merging intervals; Chaining intervals by onlyIDleft
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |