in reply to Re^4: Compare hash with arrays and print
in thread Compare hash with arrays and print

I want to write the complete header line (starting with '>') in my output files.  Can I use the split function ...?

Sure you can use the split function, but if you just want to print the header line as is (i.e. copy it from the input), you wouldn't need to split up the record. As you have it, the header would be printed already without further ado.  Think of it this way: $_ holds an entire record, with the leading '>' removed (to more easily handle the edge cases that result from the way the input is being split by $/).  For example, for the first record, $_ would hold the string (including the newlines)

"aw1.a1 bhi|tn|56564 pairs:40098 ATGCTAGATGCTAGCTAGCTAGCACTGAT CGATGCTAGCGTAGTCAGCTGATGCTGTA CGATGCTAGTCGTACG "

You can do with it whatever you like before you print it out, e.g. take it apart using split or via regex captures, perform regex substitutions on it, etc.


Some more notes: The key for the hash is extracted via regex capture

my ($name) = /^(\w+)/;

which would extract "aw1" in this case, because \w+ stops matching at the dot. In case you'd need to extract keys as "aw1.a1" (or some such — I'm no fasta expert), you could modify the regex to also capture the dots

my ($name) = /^([\w.]+)/;

or up until the first whitespace char in the line

my ($name) = /^(\S+)/;

Or in case you'd want to print the headers only (which I'm not quite sure from your description), you could extract it similarly with a regex

my ($header) = /^([^\n]+)/;

or by splitting on newlines

my ($header) = split /\n/;

And so on...
Hope this gives you some starting points to tailor it to your specific requirements.

Replies are listed 'Best First'.
Re^6: Compare hash with arrays and print
by ad23 (Acolyte) on Jul 13, 2010 at 17:56 UTC

    Thanks again almut!

    I really appreciate your inputs.

    I am trying to manipulate $_ to obtain the desired results. But when I split it with a newline the data-lines also split up. In this case, if a record holds more than one sequence data, the remaining records are not getting printed!

    while (<FASTA>) { s/^>//mg; # print "$_"; my ($name) = /^([^\n]+)/; #my ($name) = /^([\w.]+)/; #print "$name\n"; if($hash{$name} == 10){ select FILE1; } if($hash{$name} == 20){ select FILE2; } if($hash{$name} == 30){ select FILE3; } #print ">$_"; my $output = $_; my (@title) = split(/\n/,$output); print ">".$title[0]."\n".$title[1]."\n"; }

    I am splitting it with a new line character, in order to print the complete header line in my output files (and I am going wrong here).

    Please suggest?

    Thanks!

      when I split it with a newline the data-lines also split up.

      Instead of splitting on all newlines (as split would do by default)

      my (@title) = split(/\n/,$output);

      you could tell split to split into two parts only:

      my ($title, $data) = split /\n/, $output, 2;

      In this case, $data would hold all the remaining lines of sequence data.

      That said, I'm not really sure why you want to split the record in the first place, if you then proceed to print out the parts with the same newline added in between them :)  I.e.

      ... print ">".$title."\n".$data."\n";

      would result in the same output as

      print ">$output\n";

        Thanks once again for your input!

        When I try running my script with the following test set , it does not print anything in the output files, instead prints the data onto the screen!

        >001.b1 gnl|ti|10009 GCTAGTGCTAGCTAGCTAGCATCGATCGAT >002.b1 gnl|ti|10010 CAGTCAGTCGTAGTGCTAGCTGATGCTCGT >003.b1 gnl|ti|10011 CGATCGTAGTCGTATCGATGCTGACGTAGG >004.b1 gnl|ti|10012 CGATGCTAGTCGTAGTGCTAGTGCTATGTC >005.b1 gnl|ti|10013 CAGTCGTAGTCGATGCTGTATCATAGCGTA >006.b1 gnl|ti|10014 ACAGTGCTAGCTGATCGTAGCTGAGCGGAG >007.b1 gnl|ti|10015 AGCTAGCTGATGTCGATGCTGATCGTGATG >008.b1 gnl|ti|10016 CGATGCTGATGCTGATGCTGTAGCTATACG >008.g1 gnl|ti|10017 GCTAGCTAGTCGTAGTCGTAGTGTCGTAGG >009.b1 gnl|ti|10018 CGATGCTAGTCGTAGTCGTAGCTGATGCGT >010.b1 gnl|ti|10019 CGATCGTAGTCGTAGCTGATGCTGTAGCTG >010.g1 gnl|ti|10020 CGTAGCTGATCGTAGCGTGACTGTAGCTGG >011.b1 gnl|ti|10021 CGTAGCTGATGCTGATCGTAGCTAGTCGAT >011.g1 gnl|ti|10022 CAGCTGATCGTAGCTGATGCTGATGTGTGT

        I thought maybe splitting the header and data might help. But as it turns out, I am not going the right way.

        The output I get from the script is:

        >001.b1 gnl|ti|10009 GCTAGTGCTAGCTAGCTAGCATCGATCGAT >002.b1 gnl|ti|10010 CAGTCAGTCGTAGTGCTAGCTGATGCTCGT >003.b1 gnl|ti|10011 CGATCGTAGTCGTATCGATGCTGACGTAGG >004.b1 gnl|ti|10012 CGATGCTAGTCGTAGTGCTAGTGCTATGTC >005.b1 gnl|ti|10013 CAGTCGTAGTCGATGCTGTATCATAGCGTA >006.b1 gnl|ti|10014 ACAGTGCTAGCTGATCGTAGCTGAGCGGAG >007.b1 gnl|ti|10015 AGCTAGCTGATGTCGATGCTGATCGTGATG >008.b1 gnl|ti|10016 CGATGCTGATGCTGATGCTGTAGCTATACG >008.g1 gnl|ti|10017 GCTAGCTAGTCGTAGTCGTAGTGTCGTAGG >009.b1 gnl|ti|10018 CGATGCTAGTCGTAGTCGTAGCTGATGCGT >010.b1 gnl|ti|10019 CGATCGTAGTCGTAGCTGATGCTGTAGCTG >010.g1 gnl|ti|10020 CGTAGCTGATCGTAGCGTGACTGTAGCTGG >011.b1 gnl|ti|10021 CGTAGCTGATGCTGATCGTAGCTAGTCGAT >011.g1 gnl|ti|10022 CAGCTGATCGTAGCTGATGCTGATGTGTGT

        My script for the same is:

        ...... s/^>//mg; my ($name) = /^([^\n]+)/; if($hash{$name} == 10){ select FILE1; } if($hash{$name} == 20){ select FILE2; } if($hash{$name} == 30){ select FILE3; } } #print ">$_"; my $output = $_; print ">$output\n";

        Thanks again!