I have a tab-separated 2 column input file with clustering information where column 1 contains the ID of the cluster representative, while column 2 contains the ID of the cluster member

Put differently, if there are 1000 elements clustered into 50 clusters, my input file will have 1000 lines with the ID of the cluster member in column 2, and the ID of the cluster representative in column 1

Therefore, the 1st line for each cluster will necessarily contain 2 identical columns, i.e. cluster representative and cluster member are identical

If there is more than one member of a cluster, then in the next row(s), column 1 still contains the same cluster representative ID, but column 2 will contain ID of a different cluster member

Please see example below:

Osat_a Osat_a # just one cluster member Atha_b Atha_b # >1 cluster member, this & next line = 2 members Atha_b Mtru_c Fves_d Fves_d # this & next 2 lines = 3 cluster members Fves_d Osat_e Fves_d Atha_f Atha_g Atha_g # just 1 cluster member Osat_h Osat_h Osat_h Atha_i Mtru_j Mtru_j # just 1 cluster member

The input file is very large ~20GB, which is much more than my machine RAM. I suppose one way is to process such a large input is to break the input file into pieces that can be held in RAM, right? The other way, I hoping to get help here for, is to process the input straight-away while reading it in from the file handle, without writing to some large hash or array that crashes my machine! Usually I save the input to hash or array, so processing while reading in lines would be new to me, hence this request for help

The output I need to generate from this input should be as follows:

Osat_a Atha_b, Mtru_c Fves_d, Osat_e, Atha_f Atha_g Osat_h, Atha_i Mtru_j

Thanks, in advance, for your algorithm advice


In reply to Processing while reading in input by onlyIDleft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.