iangibson has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed colleagues: I have written a little program to extract relevant lines out of data files and format them nicely, as shown:

#!/usr/bin/perl use warnings; use strict; # Formats a PennCNV output file to remove extraneous information. Also + removes # CNVs containing less than 5 consecutive SNPs. Invoke this program wi +th the name # of the file to be modified. # Open input file and create output file, adding suffix '.truncated' t +o filename: open( my $file, '<', $ARGV[0] ) or die "Cannot open for reading, $!"; open( my $out, '>', $ARGV[0] . '.truncated' ) or die "Cannot open for writing, $!"; # Print headings to output file: printf $out "%-28s %-30s %-12s %-10s %-18s %-22s %-21s %s \n\n", "Samp +le I.D.", "Chromosome & coordinates", "Copy number", "No. SNPs", "CNV length (bp +)", "First SNP", "Last SNP", "Overlapping Gene(s)"; # Loop that matches each valid line of input file, using capturing par +entheses to # isolate each separate field. We also use '/x' flag to allow arbitrar +y whitespace # so we can break the regular expression up and add comments for reada +bility: while (<$file>) { if ( / (chr\d+:\d+-\d+) # $1 Chromosome & coordinates \s+ (numsnp=\d+) # $2 Number of SNPs \s+ (length=\S+) # $3 CNV length (bp) \s+ (state\d+) # $4 HMM state , (cn=\d+) # $5 Copy number \s+ (\S+) # $6 File directory \/ (\S+) # $7 Sample I.D. \s+ (startsnp=rs\d+) # $8 First SNP in CNV \s+ (endsnp=rs\d+) # $9 Last SNP in CNV \s+ (\S+) # $10 Gene(s) overlapping CNV \s+ (\S+) # $11 Distance of gene(s) from CNV /x and ( !/numsnp=[1-4]\s+/ ) # we also ignore CNVs with less tha +n 5 SNPs ) { # Print each line to output file using left-justified formatting: printf $out "%-28s %-30s %-12s %-10s %-18s %-22s %-21s %s \n", $7, $1, $5, $2, $3, $8, $9, $10; } } # Close the open filehandles: close($file); close($out);

My problem arises because I want to modify the program to write the rejected lines to a separate file, but when I try to use the capturing parentheses in an 'else' block inside the while loop right after the 'if', I end up with a blank file and perl gives me lots of errors relating to the $1, $2 etc. being out of scope.

So I'd like to know if there's an easy way of diverting any lines that don't match the 'if' patterns into a separate file.

Also, if anyone can spot other potential issues with my code, I'd appreciate constructive criticism. Thanks.

Replies are listed 'Best First'.
Re: Capturing parentheses out of scope
by ikegami (Patriarch) on Jan 20, 2011 at 21:11 UTC

    When the second match is successful, it's clobbering the captures from the first match. So save the captures of the first match before attempting the second.

    Furthermore, the structure you mention makes no sense. You'd be using $1, etc even when the first match op didn't match.

    Fix:

    while (<$file>) { if (my @captures = / (chr\d+:\d+-\d+) # 0 Chromosome & coordinates \s+ (numsnp=\d+) # 1 Number of SNPs \s+ (length=\S+) # 2 CNV length (bp) \s+ (state\d+) # 3 HMM state , (cn=\d+) # 4 Copy number \s+ (\S+) # 5 File directory \/ (\S+) # 6 Sample I.D. \s+ (startsnp=rs\d+) # 7 First SNP in CNV \s+ (endsnp=rs\d+) # 8 Last SNP in CNV \s+ (\S+) # 9 Gene(s) overlapping CNV \s+ (\S+) # 10 Distance of gene(s) from CNV /x) { if ( /numsnp=[1-4]\s+/ ) { ... } else { printf $out "%-28s %-30s %-12s %-10s %-18s %-22s %-21s %s\ +n", @captures[6,0,4,1,2,7,8,9]; } } }

    But why are you using a regex match to do a numerical comparison on a value you already have? Fix:

    while (<$file>) { if (my @captures = / (chr\d+:\d+-\d+) # 0 Chromosome & coordinates \s+ (numsnp=\d+) # 1 Number of SNPs \s+ (length=\S+) # 2 CNV length (bp) \s+ (state\d+) # 3 HMM state , (cn=\d+) # 4 Copy number \s+ (\S+) # 5 File directory \/ (\S+) # 6 Sample I.D. \s+ (startsnp=rs\d+) # 7 First SNP in CNV \s+ (endsnp=rs\d+) # 8 Last SNP in CNV \s+ (\S+) # 9 Gene(s) overlapping CNV \s+ (\S+) # 10 Distance of gene(s) from CNV /x) { if ( $captures[1] >= 1 && $captures[1] <= 4 ) { ... } else { printf $out "%-28s %-30s %-12s %-10s %-18s %-22s %-21s %s\ +n", @matches[6,0,4,1,2,7,8,9]; } } }

    This last snippet avoids the original problem, so you could go back to using $1 and such, but that would go back to needlessly using global variables.

Re: Capturing parentheses out of scope
by toolic (Bishop) on Jan 20, 2011 at 21:03 UTC
    Why can't you just divert the whole line that doesn't match to your second file
    } else { print $out2 $_; }

    Update:

    Also, if anyone can spot other potential issues with my code, I'd appreciate constructive criticism.
    You could store your string format in a variable, then use it in both printf statements:
    my $fmt = "%-28s %-30s %-12s %-10s %-18s %-22s %-21s %s \n"; ... printf $fmt, $7, $1 ...
    You could also consider using Text::Table instead.
      I could do this, but I'd like these lines to also have my new formatting if possible.
Re: Capturing parentheses out of scope
by ELISHEVA (Prior) on Jan 20, 2011 at 21:27 UTC

    If you can't just print out the entire line as toolic suggested, then you will need to break up that monster regex.

    Regex's are an all or none thing. It isn't that the variables are going out of scope. They either all get set if there is a match or none do. Since the match fails, none of the variables $1,$2, etc are set and you get those lovely undefs even for fields you know have matched one of the captures in the regex.

    Instead, you will need to split the line into fields and check for valid data in each field one by one. If each field passes your validation test, pretty format the lines and print that line to the "good lines" file. If not, print the fields of interest to the "bad lines" file.

Re: Capturing parentheses out of scope
by Anonyrnous Monk (Hermit) on Jan 20, 2011 at 21:04 UTC

    You could assign them (i.e. $1..$N) to an array declared in the outer scope — or in the if-scope: if (my @f = /...

    But what would you expect them to hold if the regex doesn't match? Or is it only the second test (and ( !/numsnp=[1-4]\s+/ )) that is expected to fail in these cases?

      To clarify: every line will match the first regex, so only the second test is important in distinguishing the two groups of lines (as you suggested). I'd like the rejected lines to go into a file in the same format as the successful lines so I can visually check that nothing was rejected that shouldn't have been.