monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have this data,
__DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN

and I would like to create a hash based on the '>' header text as key. The desired hash is:
$VAR = { 'EP11110' => 'TGCAATCACTAGCAAGCTCTCGCTGCCGTCACTAGCCTGTGG', 'EP40005' => 'GGGGCTAGGGTTAGTTCTGGANNNNNNNNNNNNNNNNNNNNN',};
And the following code I have fail to get what I want.
#!/usr/bin/perl -w use strict; use Data::Dumper; my %hash=(); my $str; while (<DATA>) { chomp; /^>(\w+)/; $hash{$1} = $str; $str .= $_; #next; } print Dumper \%hash;

Can anybody suggest, where did I go wrong?
Regards,
Edward

Replies are listed 'Best First'.
Re: Concatenating text for a hash problem
by ikegami (Patriarch) on Oct 15, 2004 at 06:54 UTC

    And now, for a totally different solution!

    while (<DATA>) { my ($key) = /^>?(\S+)/; local $/ = '>'; chomp(my $val = <DATA>); $val =~ s/\n//g; $hash{$key} = $val; }

    Redefining $/ is fun!

      I have to wake up earlier =). I've come with a similar solution redefining $/, but this time using split. Here is:

      #!/usr/local/bin/perl use strict; use warnings; use Data::Dumper; my %hash; local $/ = ">"; <DATA>; #Skip first ">" while(<DATA>){ my ($key,$value) = split /\n/, $_ , 2; $key =~ s/(\S+).*?$/$1/; $value =~ s/\n\>?//g; $hash{$key} = $value; } print Dumper \%hash; __DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN __OUTPUT__ $VAR1 = { 'EP40005' => 'GGGGCTAGGGTTAGTTCTGGANNNNNNNNNNNNNNNNNNNNN', 'EP11110' => 'TGCAATCACTAGCAAGCTCTCGCTGCCGTCACTAGCCTGTGG' };

      Regards,
      deibyz

Re: Concatenating text for a hash problem
by pg (Canon) on Oct 15, 2004 at 05:53 UTC
    use strict; use Data::Dumper; my %hash=(); my ($key, $val); while (<DATA>) { chomp; if (/^>(\w+)/) { if ($key) {$hash{$key} = $val}; $key = $1; $val = ""; } else { $val .= $_; } } $hash{$key} = $val; print Dumper \%hash; __DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN

      We came up with nearly the exact same solution. I believe you can eliminate $val, however, by operating directly on the hash value.

      use strict; use warnings; my %hash; my $id; while( my $line = <DATA> ) { chomp $line; # if( $line =~ m/^>(.+?) / ) # changed to \S if( $line =~ m/^>(\S+)/ ) { $id = $1; } else { $hash{$id} .= $line; } }

      ewijaya, if you are trying to read sequence files in fasta format, you might want to look at bioperl's Bio::SeqIO class.

      Update: ihb made a good point about the regex. I made the assumption (based on the example data) that a space will always follow the sequence ID, but that may not always be true. Therefore, ihb's regex is a bit safer for that reason (although you should keep in mind that \w will not match '.' or '-', so \S is probably better).

        Your regex change may be a bit too clever. We don't know for sure that the space always will be there. Looking at the OP, it's fair to assume that there always will be something though. Perhaps the next line is just ">EP40007". A better way to express what you want to express while not being as restrictive is /^>(\S)+/ which does what you want: gets the first non-space characters. Personally I'd probably do the check in two steps; one to see if there's a '>' there (assuming that /^>/ means a header line), and the next to see if the rest of the line holds a valid format. I habitually verify foreign input.

        ihb

        Read argumentation in its context!

Re: Concatenating text for a hash problem
by ihb (Deacon) on Oct 15, 2004 at 06:05 UTC

    A slightly different take than pg's, using references:

    my %hash; my $buf; local $_; while (<$fh>) { chomp; if (my ($key) = /^>(\w+)/) { $hash{$key} = ''; $buf = \$hash{$key}; } else { $$buf .= $_; } }
    The key difference is that you assign the value before you've filled it up rather than using a temp variable and assigning it after. This means you only have to do assignment to the hash once (getting rid of that last special assignment outside the loop), and you don't need to copy the (potentially large) value. There's no real reason for using a reference to the value rather than keeping the key around and using $hash{$key} instead of $$buf - it just came out that way and it feels nice because I've isolated the hash business to one place. It well expresses how I thought about the problem. The loop isn't about the hash really, it's about buffering up data spread over different lines. The data could just as well have been stored in an array.

    ihb

    Read argumentation in its context!

Re: Concatenating text for a hash problem
by TedPride (Priest) on Oct 15, 2004 at 06:31 UTC
    Your code has a number of problems. First of all, $str is never set back to nothing, so it eventually contains all data lines - not just the current section of data. Next, you seem to be assigning $str contents to $hash{$1} before actually putting anything into $str. Did you want to assign a reference instead? That would be $hash{$1} = \$str, and for that to work properly, $str would have to be a variable created fresh for each assignment to a new $hash{$1} - probably by doing an if statement with my $str inside it.

    Basically, the code needs to be rewritten somewhat. Here's my version for what you're trying to do:

    use strict; use warnings; my %hash; while (<DATA>) { if (/>([\w]+) /) { } else { chomp; $hash{$1} .= $_; } } for (sort keys %hash) { print "$_ => $hash{$_}\n"; } __DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN
Re: Concatenating text for a hash problem
by pizza_milkshake (Monk) on Oct 15, 2004 at 05:56 UTC
    sniff sniff, do i smell homework?

    your first problem is that you don't test to see if the regex matches, because you want to treat lines differently (> lines become hash keys, lines after then become their data). the next two lines make no sense, you append every $_ to $str but never clear it. at the end of the loop it will contain a concatenated copy of every line of data post-chomp.

    i'm not going to post working code because this is very basic logic; i'm guessing you were given an assignment to do this and got frustrated. throw an if (/regex/) { this is a key } else { this is a data line } and you'll start going somewhere

    perl -e"\$_=qq/nwdd\x7F^n\x7Flm{{llql0}qs\x14/;s/./chr(ord$&^30)/ge;print"

Re: Concatenating text for a hash problem
by Roger (Parson) on Oct 15, 2004 at 12:39 UTC
    Just another method -

    #!/usr/bin/perl -w use strict; use Data::Dumper; local $/ = '>'; while (<DATA>) { next if $_ eq '>'; my ($t, $v) = $_ =~ /^(\w+).*?\n([^>]*)/gs; $v =~ s/\n//g; print "$t, $v\n"; } __DATA__ >EP11110 (-) TGCAATCACTAGCAAGCTCTC GCTGCCGTCACTAGCCTGTGG >EP40005 (+) GGGGCTAGGGTTAGTTCTGGA NNNNNNNNNNNNNNNNNNNNN
      You neither need nor support /g on the first regexp. You don't need the /s either, but /s is forgivable.