matth has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm coming back with a similar bit of code as to what I posted yesterday. I am at a loss as to why variables are not passing into the extactseq subroutine.

The code is:

#!/usr/bin/perl -w ######### ### Programed 15_12_02_by Matthew Redden ######### sub check_gene($); sub check_gene_seq($); sub extractseq($$$$); #use strict; mkdir Output; $input = "new_out_again_12_12_02_B_hand_edit_15_12_02.txt"; open (INPUT, "<$input"); while (<INPUT>){ if ($_ =~ /\s{3}\<gene\sid\s\=\s\"(\d{1,6})\"\slabel\s\=\s\"([.|\. +]{1,40})\"\>/){ check_gene ($_); } if ($_ =~ /\s{4}\<gene_seq\sid\s\=\s\"(\d{0,6})\"\sstatus\s{0,2}\= +\s{0,2}\"(.{0,50})\"\s{0,2}CDS_number\s{0,2}\=\s{0,2}\"(\d{1,3})\"\s{ +0,2}number_of_CDSs\s{0,2}\=\s{0,2}\"(\d{0,5})\"\s{0,2}sequence_source +\s{0,2}\=\s{0,2}\"(.{0,300})\"\s{0,2}startpos\s{0,2}\=\s{0,2}\"(\d{0, +9})\"\s{0,2}endpos\s{0,2}\=\s{0,2}\"(\d{0,9})\"\s{0,2}startopen\s{0,2 +}\=\s{0,2}\"(\d{0,1})\"\sendopen\s{0,2}\=\s{0,2}\"(\d{0,1})\"\s{0,2}c +omplement\s{0,2}\=\s{0,2}\"(.{0,1})\"\>/){ check_gene_seq ($_); $label = "$gene_seq_id.$gene_seq_status.$gene_seq_CDS_number.$gene_se +q_number_of_CDSs.$gene_seq_sequence_source.$gene_seq_startpos.$gene_s +eq_endpos.$gene_seq_startopen.$gene_seq_endopen.$gene_seq_complement" +; } extractseq ($label,$gene_seq_sequence_source,$gene_seq_startpos,$g +ene_seq_endpos); } sub check_gene($) { # <gene id = "242" label = "SPCC576.16"> my $line = $_; if ($line =~ /\s{3}\<gene\sid\s\=\s\"(\d{1,6})\"\slabel\s\=\s\"([. +|\.]{1,40})\"\>/){ $gene_id = $1; $gene_label = $2; } else { print "ERROR. Somebody has altered the regular expression"; } return $gene_id; return $gene_label; } sub check_gene_seq($){ # <gene_seq id = "311" status = "Sanger source DNA code" CDS_number + = "3" number_of_CDSs = "" sequence_source = "" startpos = "2110545" +endpos = "2110823" startopen = "1" endopen = "1" complement = "C"/> if ($_ =~ /\s{4}\<gene_seq\sid\s\=\s\"(\d{0,6})\"\sstatus\s{0,2}\= +\s{0,2}\"(.{0,50})\"\s{0,2}CDS_number\s{0,2}\=\s{0,2}\"(\d{1,3})\"\s{ +0,2}number_of_CDSs\s{0,2}\=\s{0,2}\"(\d{0,5})\"\s{0,2}sequence_source +\s{0,2}\=\s{0,2}\"(.{0,300})\"\s{0,2}startpos\s{0,2}\=\s{0,2}\"(\d{0, +9})\"\s{0,2}endpos\s{0,2}\=\s{0,2}\"(\d{0,9})\"\s{0,2}startopen\s{0,2 +}\=\s{0,2}\"(\d{0,1})\"\sendopen\s{0,2}\=\s{0,2}\"(\d{0,1})\"\s{0,2}c +omplement\s{0,2}\=\s{0,2}\"(.{0,1})\"\>/){ $gene_seq_id = $1; $gene_seq_status = $2; $gene_seq_CDS_number = $3; $gene_seq_number_of_CDSs = $4; $gene_seq_sequence_source = $5; $gene_seq_startpos = $6; $gene_seq_endpos = $7; $gene_seq_startopen = $8; $gene_seq_endopen = $9; $gene_seq_complement = $10; } else { print "Error. Someone has altered the reguar expression"; } return $gene_seq_id; return $gene_seq_status; return $gene_seq_CDS_number; return $gene_seq_number_of_CDSs; return $gene_seq_sequence_source; return $gene_seq_startpoks; return $gene_seq_endpos; return $gene_seq_startopen; return $gene_seq_endopen; return $gene_seq_complement; } sub extractseq($$$$){ # Pass through the FASTA label, the location of +sequence, the start pos and the end pos. $label = $_[0]; $gene_seq_sequence_source = $_[1]; $startpos = $_[2]; $endpos = $_[3]; print "$gene_seq_sequence_source\n"; sleep 1; print "system ('extractseq $gene_seq_sequence_source extracted_seq +_$label.txt -regions \"$startpos..$endpos\"')\;"; system ("extractseq $gene_seq_sequence_source extracted_seq_$label +.txt -regions \"$startpos..$endpos\""); } </data> The error message I am getting is: <data> Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 110, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 110, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 110, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 110, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 111, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 111, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 111, <INPUT> line 41. Use of uninitialized value in concatenation (.) or string at extract_s +eqs.pl line 111, <INPUT> line 41. system ('extractseq extracted_seq_.txt -regions ".."');Extract region +s from a sequence Error: failed to open filename extracted_seq_.txt Error: Unable to read sequence 'extracted_seq_.txt'

I know that this is a bit of an open-ended debugging job. Sorry about that.

Replies are listed 'Best First'.
Re: Variables not entering sub routine
by Zaxo (Archbishop) on Dec 16, 2002 at 13:14 UTC

    A few comments on style and readability

    • Pick an indentation style and use it. perlstyle gives good advice on one popular choice.
    • I don't know if your regexen are really complicated or just long. The /x modifier will permit whitespace, including newlines, and comments in the regex. You code could use that.
    • You haven't shown what the data really looks like. There may be a better way of parsing it. Show some data and we may be able to point you to a better way.
    • I don't see what benefit you're getting from prototypes.
    • Uncomment use strict; and pitch in a use warnings;. They are a help in untangling errors.
    • Your subroutines &check_gene and &check_gene_seq have prototypes, but never look at the argument, only at whatever $_ is at the point of call.
    • The &extract_seq sub is called with one more argument than the prototype calls for, and the extra is never used.

    Update: Just noticed - my 1000th node.

    After Compline,
    Zaxo

      An example of the two lines of interest in the intput text is shown below:
      <gene id = "251" label = "gene_of_interest"> <gene_seq id = "321" status = "Sanger source DNA code" CDS_number += "1" number_of_CDSs = "" sequence_source "/data/databases/flatfiles/ +sequences/species/genome/embl/ch1.embl" startpos = "2435591" endpos = + "2436562" startopen = "1" endopen = "1" complement = "F"/>
      If I could ask an XML question in relation to this. Does any XML convention suggest that the two ids here should have the same value?

        If your data is valid XML, run, don't walk, to the XML namespace. Something there will save you a bazillion headaches and make this job easy.

        Yes, you haven't yet reached a closing </gene> tag so the gene_seq data is part of gene's data. If there is no closing tag, you don't have valid XML.

        After Compline,
        Zaxo

Re: Variables not entering sub routine
by derby (Abbot) on Dec 16, 2002 at 13:23 UTC
    to add to Zaxo's comments:

    you call extractseq for every line in your input file (that's probably why you're getting the uninit warnings). Shouldn't that call be tucked inside the second conditional or at least inside another conditional that ensures all the parameters have been set?

    -derby

Re: Variables not entering sub routine
by chromatic (Archbishop) on Dec 16, 2002 at 18:41 UTC

    What makes you think the variables aren't being passed in to your function?

    You have several problems. One of the most serious is that you're parsing (what looks like) XML with regular expressions and you aren't checking to see if the regexes match. Another problem is that you're using global variables. Why are you bothering to return variables from your functions if you're not going to use them? (By the way, your returns are still broken. Why should I give you further advice if you ignored my previous advice? I don't know.)

    The problem you're seeing right here is that you're trying to call check_gene_seq() with variables that may or may not be defined, and that may or may not come from the line you're currently processing. Another comment recommend using lexicals and a hash, and I agree. You'll probably get a much better payoff if you use a real XML parsing module though.

Re: Variables not entering sub routine
by Jasper (Chaplain) on Dec 16, 2002 at 15:38 UTC
    $gene_seq_id = $1;
    $gene_seq_status = $2;
    $gene_seq_CDS_number = $3;
    ...
    $gene_seq_endopen = $9;
    $gene_seq_complement = $10;


    Is there a specific reason you're not using a hash here? Then you can access the data more easily, and pass it around more easily.
    my %gene_seq = ( id => $1, status => $2, CDS_number => $3, ... # the yaddah yaddah yaddah operator ); return \%gene_seq; # return a reference to the hash, to be dereffed at + the other end
    Apologies for not helping with your specific problem more, but once the code is down to a reasonable size, I'm sure it'll be much easier to debug. :)

    Jasper

    ps do '='s need to be escaped in a regex? (rhetorical question) Also, using different regex delimiters (# instead of /) might make it more readable again when you are escaping a load of characters. Well, unless you're using a lot of comments in a //x regex! Then you could use a pipe, or whatever.