If Statements and Regular Expressions

TempAcolyte has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use a regular expression to search for values stored in hash that begin with three letters: "Hox." This is driving me crazy and I figure I must be overlooking something very simple. I've got my script working so that everything works except my if statement. Can you wise monks help me see what I can't?

#!/usr/bin/perl -w
# Open Tab-delimited file, extract and display HOX gene info
#FILE HANDLING
open INPUT,"<MOUSE_TF1.txt";    
open OUTPUT,">HOX_GENE_TF.txt";   
#
# MAIN
print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n
+";
print OUTPUT "ENSEMBL GENE ID \tSYMBOL \tCHR\tSTART\t\tEND\t\tSTRAND\n
+";
while (<INPUT>){
  $line = $_;
  chop($line);
# split line by tab character
  @Gene_Info = split("\t",$line);            
# Defines fields as scalars
  $Ensembl_Gene_ID = $Gene_Info[0];          
  $Chromosome_Name = $Gene_Info[1];
  $Gene_Start = $Gene_Info[2];
  $Gene_End = $Gene_Info[3];
  $Chr_Strand = $Gene_Info[4];
  $Gene_Symbol = $Gene_Info[5];
# Create hashes to store table values and associate with key
  $chrs{$Ensembl_Gene_ID} = $Chromosome_Name;
  $starts{$Ensembl_Gene_ID} = $Gene_Start;
  $ends{$Ensembl_Gene_ID} = $Gene_End;
  $symbols{$Ensembl_Gene_ID} = $Gene_Symbol;
  $strands{$Ensembl_Gene_ID} = $Chr_Strand;
  }
close INPUT;
# select Gene Symbols belonging to "Hox" family and print
foreach $key (keys %symbols) {
  if ($Gene_Symbol =~ /Hox/) {
     print OUTPUT $key,"\t";
        print OUTPUT $symbols{$key},"\t";
        print OUTPUT $chrs{$key},"\t";
        print OUTPUT $starts{$key},"\t";
        print OUTPUT $ends{$key},"\t";
        print OUTPUT $strands{$key},"\n";
   }
}
close INPUT;
close OUTPUT;
#end
[download]

Comment on If Statements and Regular Expressions Download Code

Replies are listed 'Best First'.
Re: If Statements and Regular Expressions by Animator (Hermit) on Sep 30, 2008 at 22:42 UTC
What you are missing is `use strict;`. I stronlgy suggest you read: Coping with scoping. As far as your problem goes: I'm guessing you want: `if ($key =~ /Hox/) {` instead of `if ($Gene_Symbol =~ /Hox/) {`.	[reply] [d/l] [select]
Re: If Statements and Regular Expressions by GrandFather (Saint) on Sep 30, 2008 at 23:16 UTC
There is a pile of stuff that can be tidied up there. First off, always use strictures (use strict; use warnings;). Use chomp instead of chop. Use the three parameter version of open and check the result. Use lexical file handles: `open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!";` [download] Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash: use warnings; use strict; open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!"; my %families; my @fields = qw(chr start end symbol strand); while (<$inFile>) { my $line = $_; chomp ($line); my @Gene_Info = split "\t", $line; my $id = shift @Gene_Info; @{$families{$id}}{@fields} = @Gene_Info; } close $inFile; # select Gene Symbols belonging to "Hox" family and print foreach my $key (keys %families) { if ($key =~ /Hox/) { print join ("\t", $key, @{$families{$key}}{@fields}), "\n"; } } [download] untested Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^2: If Statements and Regular Expressions by JavaFan (Canon) on Oct 01, 2008 at 00:12 UTC
Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash But that uses more memory. Which can be significant if you have a lot of data. #!/usr/bin/perl use 5.010; use strict; use warnings; use Devel::Size qw [total_size]; my $size = 250_000; my @fields = qw [chr start end symbol strand]; my %families; my @structs = \my (%chr, %start, %end, %symbol, %strand); foreach my $key (1 .. $size) { $families{$key}{$_} = undef for @fields; $$_{$key} = undef for @structs; } my $s1 = total_size \%families; my $s2 = 0; $s2 += total_size $_ for @structs; printf "Big hash: %d Mb\n", $s1 / (1024 * 1024); printf "More hashes: %d Mb\n", $s2 / (1024 * 1024); printf "Savings: %.0f%%\n", 100 * ($s1 - $s2) / $s1; __END__ Big hash: 71 Mb More hashes: 61 Mb Savings: 14% [download] Grouping the data in a hash causes you to have 1250001 hashes, instead of just 5. Which carries a 10 Mb penalty.	[reply] [d/l]
Re^3: If Statements and Regular Expressions by GrandFather (Saint) on Oct 01, 2008 at 00:27 UTC
Interesting. Using Perl 5.8.8 rather than 5.10 makes the same saving for 'Big hash': `Big hash: 61 Mb More hashes: 52 Mb Savings: 16%` [download] so should we all use 5.8.8 rather than 5.10? Actually 14 (or 16) % is sufficiently small to be irrelevant for most purposes. Clarity of code is the more important factor (not that I'm claiming that my code is any clearer mind you), especially for a first cut - optimize later if you need to. Perl reduces RSI - it saves typing	[reply] [d/l]
Re^4: If Statements and Regular Expressions by JavaFan (Canon) on Oct 01, 2008 at 00:56 UTC
Re: If Statements and Regular Expressions by pjotrik (Friar) on Sep 30, 2008 at 22:50 UTC
You're definitely not matching what you want to... In the line `if ($Gene_Symbol =~ /Hox/) {`, $Gene_Symbol still has the value it had in the last round of the while loop; it is in no way affected by the foreach loop. What you probably want is to match against the value that you stored in the %symbols hash for every key. That is, `if ($symbols{$key} =~ /Hox/) {` [download]	[reply] [d/l] [select]
Re: If Statements and Regular Expressions by toolic (Bishop) on Sep 30, 2008 at 23:33 UTC
If I'm not mistaken, your are trying to print out lines in which the "symbol" starts with the string "Hox". If that is the case, I see no need for building up a hash at all; just process the lines as you read them in. If you really want to re-order the columns, then this might be a simpler way to do things (UNTESTED): use strict; use warnings; open INPUT,"<MOUSE_TF1.txt" or die "can not open MOUSE_TF1.txt: $!"; open OUTPUT,">HOX_GENE_TF.txt" or die "can not open HOX_GENE_TF.txt: $ +!"; print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n +"; print OUTPUT "ENSEMBL GENE ID \tSYMBOL \tCHR\tSTART\t\tEND\t\tSTRAND\n +"; while (<INPUT>){ chomp; my ($id, $cname, $start, $end, $strand, $sym) = split /\t/; # select Gene Symbols belonging to "Hox" family and print if ($Sym =~ /^Hox/) { print join("\t", $id, $sym, $cname, $start, $e +nd, $strand), "\n" } } close INPUT; close OUTPUT; [download] If you are willing to preserve the column order from the input, this simplifies even further to: `use strict; use warnings; open INPUT,"<MOUSE_TF1.txt" or die "can not open MOUSE_TF1.txt: $!"; open OUTPUT,">HOX_GENE_TF.txt" or die "can not open HOX_GENE_TF.txt: $ +!"; print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n +"; print OUTPUT "ENSEMBL GENE ID \tCHR\tSTART\t\tEND\t\tSTRAND\tSYMBOL\n" +; while (<INPUT>){ chomp; my @items = split /\t/; # select Gene Symbols belonging to "Hox" family and print if ($items[5] =~ /^Hox/) { print "$_\n" } } close INPUT; close OUTPUT;` [download]	[reply] [d/l] [select]
Re: If Statements and Regular Expressions by JadeNB (Chaplain) on Sep 30, 2008 at 23:12 UTC
In addition to the very useful replies above, note that you are matching the string `Hox` any time it occurs in the word. If you want it only at the beginning, then you need the regex `/^Hox/`.	[reply] [d/l] [select]
Re: If Statements and Regular Expressions by TempAcolyte (Initiate) on Oct 01, 2008 at 01:59 UTC
Thank you all for your help and insights. Reading through your comments and code critiques (and the article link) was just what I needed tonight. I tightened up my code per your suggestions, changed: `if ($Gene_Symbol =~ /Hox/) {` [download] to `if ($symbols{$key} =~ /^Hox/) {` [download] and it worked like a charm. Now I not only have a script that works, but I also learned a few new approaches to try as well. (I am working with very large datasets, so I think I'll stick with five hashes.) Thanks again wise monks!! -- TempAcolyte	[reply] [d/l] [select]
Re: If Statements and Regular Expressions by JavaFan (Canon) on Sep 30, 2008 at 23:48 UTC
Considering that what you basically do is finding lines that start with 'Hox', why don't you just do: `$ grep ^Hox MOUSE_TF1.txt > HOX_GENE_TF.txt` [download] If the reordering of the columns is essential, you can always pipe the output of grep through col.	[reply] [d/l]