Re: concatenating identical sequences

$new_guy,
How big are the files you need to work with. It should be fairly trivial to do this using a hash of arrays if everything fits in memory. Assume for a second you had a function that could fetch the next record (mutli-line or not) as well as the id. It would look something like this:

my %data;
while (my $rec = fetch_record($fh)) {
    my $id = $rec->{id};
    push @{$data{$id}}, $rec->{sequence};
}
for my $id (keys %data) {
    print "$id ";
    print "$_\n" for @{$data{$id}};
}
[download]

Alternatively, if you can't afford to fit the entire file in memory, you could still use this technique by storing the file offset and not the actual sequence. This will require more IO with tell and seek but should allow the same simplicity in the code.

One last alternative would be to re-write the file merging all the rows for a record on one line. Next, sort the file so duplicate IDs are adjacent and then it should be straight forward to merge them. Since it appears each row is fixed length, recreating the original structure from a single line should be straight forward.

Cheers - L~R

Comment on Re: concatenating identical sequences Download Code