comment on

I have a tab-separated 2 column input file with clustering information where column 1 contains the ID of the cluster representative, while column 2 contains the ID of the cluster member

Put differently, if there are 1000 elements clustered into 50 clusters, my input file will have 1000 lines with the ID of the cluster member in column 2, and the ID of the cluster representative in column 1

Therefore, the 1st line for each cluster will necessarily contain 2 identical columns, i.e. cluster representative and cluster member are identical

If there is more than one member of a cluster, then in the next row(s), column 1 still contains the same cluster representative ID, but column 2 will contain ID of a different cluster member

Please see example below:

 
Osat_a    Osat_a # just one cluster member
Atha_b    Atha_b # >1 cluster member, this & next line = 2 members
Atha_b    Mtru_c 
Fves_d    Fves_d # this & next 2 lines = 3 cluster members
Fves_d    Osat_e
Fves_d    Atha_f
Atha_g    Atha_g # just 1 cluster member
Osat_h    Osat_h
Osat_h    Atha_i
Mtru_j    Mtru_j # just 1 cluster member
[download]

The input file is very large ~20GB, which is much more than my machine RAM. I suppose one way is to process such a large input is to break the input file into pieces that can be held in RAM, right? The other way, I hoping to get help here for, is to process the input straight-away while reading it in from the file handle, without writing to some large hash or array that crashes my machine! Usually I save the input to hash or array, so processing while reading in lines would be new to me, hence this request for help

The output I need to generate from this input should be as follows:

Osat_a
Atha_b, Mtru_c
Fves_d, Osat_e, Atha_f
Atha_g
Osat_h, Atha_i
Mtru_j
[download]

Thanks, in advance, for your algorithm advice

In reply to Processing while reading in input by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.