Hello Perl Monks!

I have two bioinformatics problems involving GFF3 and BED file formats, with no obvious off the shelf solution from BedTools or BedOps. The problems are:

Problem 1. Merge coordinates between two BED files, but not within them, at less than or equal to user-specified distance of separation

Problem 2. Chain coordinates between two BED files, at less than or equal to user-specified distance of separation

I have not written code yet, because I wonder if there are Perl or even BioPerl modules to process generic numerical intervals, or more specifically genomic intervals. Are there? I could not find any. I provide an example below for expected solution to problems 1 and 2. Thank you!

File 1 - Genomic or numerical coords of feature type A - A1, A2, ... An

File 2 - Genomic or numerical coords of feature type B - B1, B2, ...Bm

Let's say the concatenated Files 1 and 2, followed by sort looks like this:

Chr1 1000 4000 A1 Chr1 12000 18500 B1 Chr1 15000 22000 A2 Chr1 28000 29000 B2 Chr1 30000 32000 A3 Chr1 42000 44000 A4

Problem 1 - Report merged coordinates for pairs of AB separated by not more than 10KB

note that the merged intervals cannot be for consecutive features of ONLY As or Bs, it MUST be pairs of A+B, i.e. only one of each of A and B, OR they can be singletons of A or B that are UNpaired

Chr1 1000 4000 A1 Chr1 12000 22000 B1,A2 Chr1 28000 32000 B2,A3 Chr1 42000 44000 A4

# In the last line of the expected solution shown above, though separated by 10KB, which is user-specified limit, A4 cannot be paired with A3, since the pairs have to be strictly ABs

# A1 and B1 are not merged here because B1 and A2 overlap, and have no distance of separation between them, and therefore, the pairing of B1 with A2 is prioritized over pairing of B1 with A1

Problem 2 - Report coordinates for collapsed or chained intervals which contain As and Bs, component pairs of which are not separated by more than 7KB, in the example result below.

Here the "chained" coords may be singletons that cannot be chained due to distance, or when chained, can be runs of the same type, i.e., A1,A2 or B1,B2,B3

Chr1 1000 4000 A1 Chr1 12000 32000 B1,A2,B2,A3 Chr1 42000 44000 A4

# In the last line of expected solution shown above, separation is 10KB from A3, else it would've been a longer "chain" of B1,A2,B2,A3,A4


In reply to Merging intervals; Chaining intervals by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.