in reply to Concatenating text for a hash problem

use strict; use Data::Dumper; my %hash=(); my ($key, $val); while (<DATA>) { chomp; if (/^>(\w+)/) { if ($key) {$hash{$key} = $val}; $key = $1; $val = ""; } else { $val .= $_; } } $hash{$key} = $val; print Dumper \%hash; __DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN

Replies are listed 'Best First'.
Re^2: Concatenating text for a hash problem
by bobf (Monsignor) on Oct 15, 2004 at 06:04 UTC

    We came up with nearly the exact same solution. I believe you can eliminate $val, however, by operating directly on the hash value.

    use strict; use warnings; my %hash; my $id; while( my $line = <DATA> ) { chomp $line; # if( $line =~ m/^>(.+?) / ) # changed to \S if( $line =~ m/^>(\S+)/ ) { $id = $1; } else { $hash{$id} .= $line; } }

    ewijaya, if you are trying to read sequence files in fasta format, you might want to look at bioperl's Bio::SeqIO class.

    Update: ihb made a good point about the regex. I made the assumption (based on the example data) that a space will always follow the sequence ID, but that may not always be true. Therefore, ihb's regex is a bit safer for that reason (although you should keep in mind that \w will not match '.' or '-', so \S is probably better).

      Your regex change may be a bit too clever. We don't know for sure that the space always will be there. Looking at the OP, it's fair to assume that there always will be something though. Perhaps the next line is just ">EP40007". A better way to express what you want to express while not being as restrictive is /^>(\S)+/ which does what you want: gets the first non-space characters. Personally I'd probably do the check in two steps; one to see if there's a '>' there (assuming that /^>/ means a header line), and the next to see if the rest of the line holds a valid format. I habitually verify foreign input.

      ihb

      Read argumentation in its context!