andyBio has asked for the wisdom of the Perl Monks concerning the following question:

Hello, wise ones! I am trying to read in a file with content like this:

TATTATGAGAATAGTGTGCATTTT 3 ATAGAGCAAAAGGGCAAATGCTGA 6 TACGAGTAGGATATCGATCTGGTGG 2 ATCCCCGGCATCTCCGCCA 1 TGAGAATAGTGTGCATTT 52 CGCATTACATTTGGAGCC 1 ACTCCAGGCAGCGTAGAGTT 1 ATCAACGTTGCTGCATCGG 1
It is a tab demilimited file. I want to manipulate the file to have it look like this:
>dme0_count=3 TATTATGAGAATAGTGTGCATTTT >dme1_count=6 ATAGAGCAAAAGGGCAAATGCTGA >dme2_count=52 TGAGAATAGTGTGCATTT
(i). Sequences(column 0) with length less than 15 or greater than 30 are ommitted (ii). Sequences with counts (Column 1) < 2 are ommitted (iii). the print statement takes uses an argument and the count for the header line. Here is my code:

#!/usr/bin/perl -w use strict; use warnings; my $species=$ARGV[0]; my $input=$ARGV[1]; my @fields; my $n = 0; open my $tabdata, '<', "$input" or die ("Can't open $input\n"); while (my $line = <$tabdata>) { foreach my $line ($tabdata){ my @fields = split("\t",$line); if(($fields[1] > 2) && (length($fields[0]) > 14 && length($fields[ +0]) < 31)) {print ">$species" . $n++ . "_count=$fields[1]\n$fields[0]\n"} +; } } close ($tabdata);
Here is the error: a> line 1. Use of uninitialized value $fields[1] in numeric gt (>) at tab.pl line + 19, <$tabdata> line 2. Use of uninitialized value $fields[1] in numeric gt (>) at tab.pl line + 19, <$tabdata> line 3. . . . Use of uninitialized value $fields[1] in numeric gt (>) at tab.pl line + 19, <$tabdata> line 21.

I'd appreciate some assistance on how to get this to work! Thank you.

Replies are listed 'Best First'.
Re: Manipulating tab delimited file
by davido (Cardinal) on Apr 29, 2016 at 02:29 UTC

    (Update: added $species detection from @ARGV)

    Here's the code...

    my ($c, $species) = (0, shift()); while(<>) { chomp; next unless length; my ($seq, $scount) = split /\s+/; next if $scount < 2 || length $seq < 15 || length $seq > 30; print ">$species" . $c++ . "_count=$scount\n$seq\n"; }

    If it's named process_sequences, then it could be invoked as:

    process-sequences speciesname inputfilename >outputfilename

    It's too bad the filenames aren't named for the species they represent, because then you could do something like this:

    my ($c, $species) = (0, $ARGV[0]); open my $outfh, '>', $species . ".new" || die $!; while(<>) { chomp; if(length) { my ($seq, $scount) = split /\s+/; if($scount >= 2 && length $seq >= 15 && length $seq <= 30) { print $outfh ">$species" . $c++ . "_count=$scount\n$seq\n" +; } } if (eof()) { ($species, $c) = ($ARGV[0], 0); close $outfh || die $!; open $outfh, '>', $species . ".new" || die $!; } }

    And that would be invoked with a list of filenames, each named for the species:

    process-sequences bird bee cat dog

    Dave

      Thanks for the help, Dave! I appreciate your wisdom!
Re: Manipulating tab delimited file
by kevbot (Vicar) on Apr 29, 2016 at 02:44 UTC

    The while loop will iterate through all the lines of your input file, so the foreach loop is unecessary. Also, a chomp is needed in order to avoid extraneous newlines in your output. This edited version of your code seems to work.

    #!/usr/bin/env perl use strict; use warnings; my $species=$ARGV[0]; my $input=$ARGV[1]; my @fields; my $n = 0; open my $tabdata, '<', "$input" or die ("Can't open $input\n"); while (my $line = <$tabdata>) { chomp $line; my @fields = split("\t",$line); if(($fields[1] > 2) && (length($fields[0]) > 14 && length($fields[ +0]) < 31)) { print ">$species" . $n++ . "_count=$fields[1]\n$fields[0]\ +n"; } } close ($tabdata); exit;
      Hello, Monks! Please, one more question. I didn't think I should start a new thread for this.
      while( defined(my $head = <FASTQIN>) && defined(my $seq = <FASTQIN>) && defined(my $qhead = <FASTQIN>) && defined(my $quality = <FASTQIN>) ){ substr($head, 0, 1, '>'); my $temp = print $head, $seq; } close (FASTQIN); my %count_seq; open my $fh, '<:encoding(UTF-8)', $temp or die "Cannot open $temp +$!"; while (<$fh>) { chomp; next if /^>/; next if length($_) > 30 or length($_) < 15; $count_seq{$_}++; }
      this is a section of code I am trying to write. Is there a way to make the variable  my $temp that was used in the while loop also available in the open filehandle statement? The code works OK till  close (FASTQIN);. the error reads:
      Use of uninitialized value $temp in open at test.pl line 44. Use of uninitialized value $temp in concatenation (.) or string at tes +t.pl line 44. Cannot open No such file or directory at test.pl line 44.
      Line 44 is this:
      open my $fh, '<:encoding(UTF-8)', $temp or die "Cannot open $temp $!";
      Kindly advise, monks! Thanks!
        We don't know what your code looks like now, so we don't know where and how you're using the $temp variable, but if it was declared within the while loop, then it no longer exists once the while loop is completed.

        So you could just declare a new my $temp; variable there (although, since you ask the question, these things are probably not entirely clear to you, you might as well use another variable name for clarity).

        Hello andyBio,

        It looks as though your first while loop constructs filename strings and prints them out; then, immediately before the second while loop, you want to open the last constructed filename for writing. If this is what you are doing, there are three problems:

        (1) As Laurent_R says, $temp is out of scope by the time you try to use it in the open statement. To fix this, you need to give it a wider scope:

        ... my $temp; # declare $temp before the loop while (...) { ... } close FASTQIN; my %count_seq; open my $fh, ... # $temp is still in scope ...

        (2) The statement my $temp = print $head, $seq; doesn’t do what you want it to, because print returns a boolean value indicating whether the print operation was successful or not. You need something like this:

        $temp = $head . $seq; print $temp;

        (3) The first while loop’s condition may never be true, in which case $temp will never be initialized; so you should test for this possibility:

        close FASTQIN; my %count_seq; if (defined $temp) { open my $fh, '<:encoding(UTF-8)', $temp or die "Cannot open $temp +$!"; while (<$fh>) { ... } }

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Oh, yes, it works! Thank you so much!