TempAcolyte has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use a regular expression to search for values stored in hash that begin with three letters: "Hox." This is driving me crazy and I figure I must be overlooking something very simple. I've got my script working so that everything works except my if statement. Can you wise monks help me see what I can't?
#!/usr/bin/perl -w # Open Tab-delimited file, extract and display HOX gene info #FILE HANDLING open INPUT,"<MOUSE_TF1.txt"; open OUTPUT,">HOX_GENE_TF.txt"; # # MAIN print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n +"; print OUTPUT "ENSEMBL GENE ID \tSYMBOL \tCHR\tSTART\t\tEND\t\tSTRAND\n +"; while (<INPUT>){ $line = $_; chop($line); # split line by tab character @Gene_Info = split("\t",$line); # Defines fields as scalars $Ensembl_Gene_ID = $Gene_Info[0]; $Chromosome_Name = $Gene_Info[1]; $Gene_Start = $Gene_Info[2]; $Gene_End = $Gene_Info[3]; $Chr_Strand = $Gene_Info[4]; $Gene_Symbol = $Gene_Info[5]; # Create hashes to store table values and associate with key $chrs{$Ensembl_Gene_ID} = $Chromosome_Name; $starts{$Ensembl_Gene_ID} = $Gene_Start; $ends{$Ensembl_Gene_ID} = $Gene_End; $symbols{$Ensembl_Gene_ID} = $Gene_Symbol; $strands{$Ensembl_Gene_ID} = $Chr_Strand; } close INPUT; # select Gene Symbols belonging to "Hox" family and print foreach $key (keys %symbols) { if ($Gene_Symbol =~ /Hox/) { print OUTPUT $key,"\t"; print OUTPUT $symbols{$key},"\t"; print OUTPUT $chrs{$key},"\t"; print OUTPUT $starts{$key},"\t"; print OUTPUT $ends{$key},"\t"; print OUTPUT $strands{$key},"\n"; } } close INPUT; close OUTPUT; #end

Replies are listed 'Best First'.
Re: If Statements and Regular Expressions
by Animator (Hermit) on Sep 30, 2008 at 22:42 UTC

    What you are missing is use strict;.

    I stronlgy suggest you read: Coping with scoping.

    As far as your problem goes: I'm guessing you want: if ($key =~ /Hox/) { instead of if ($Gene_Symbol =~ /Hox/) {.

Re: If Statements and Regular Expressions
by GrandFather (Saint) on Sep 30, 2008 at 23:16 UTC

    There is a pile of stuff that can be tidied up there. First off, always use strictures (use strict; use warnings;).

    Use chomp instead of chop.

    Use the three parameter version of open and check the result. Use lexical file handles:

    open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!";

    Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash:

    use warnings; use strict; open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!"; my %families; my @fields = qw(chr start end symbol strand); while (<$inFile>) { my $line = $_; chomp ($line); my @Gene_Info = split "\t", $line; my $id = shift @Gene_Info; @{$families{$id}}{@fields} = @Gene_Info; } close $inFile; # select Gene Symbols belonging to "Hox" family and print foreach my $key (keys %families) { if ($key =~ /Hox/) { print join ("\t", $key, @{$families{$key}}{@fields}), "\n"; } }

    untested


    Perl reduces RSI - it saves typing
      Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash

      But that uses more memory. Which can be significant if you have a lot of data.

      #!/usr/bin/perl use 5.010; use strict; use warnings; use Devel::Size qw [total_size]; my $size = 250_000; my @fields = qw [chr start end symbol strand]; my %families; my @structs = \my (%chr, %start, %end, %symbol, %strand); foreach my $key (1 .. $size) { $families{$key}{$_} = undef for @fields; $$_{$key} = undef for @structs; } my $s1 = total_size \%families; my $s2 = 0; $s2 += total_size $_ for @structs; printf "Big hash: %d Mb\n", $s1 / (1024 * 1024); printf "More hashes: %d Mb\n", $s2 / (1024 * 1024); printf "Savings: %.0f%%\n", 100 * ($s1 - $s2) / $s1; __END__ Big hash: 71 Mb More hashes: 61 Mb Savings: 14%
      Grouping the data in a hash causes you to have 1250001 hashes, instead of just 5. Which carries a 10 Mb penalty.

        Interesting. Using Perl 5.8.8 rather than 5.10 makes the same saving for 'Big hash':

        Big hash: 61 Mb More hashes: 52 Mb Savings: 16%

        so should we all use 5.8.8 rather than 5.10? Actually 14 (or 16) % is sufficiently small to be irrelevant for most purposes. Clarity of code is the more important factor (not that I'm claiming that my code is any clearer mind you), especially for a first cut - optimize later if you need to.


        Perl reduces RSI - it saves typing
Re: If Statements and Regular Expressions
by pjotrik (Friar) on Sep 30, 2008 at 22:50 UTC
    You're definitely not matching what you want to...

    In the line if ($Gene_Symbol =~ /Hox/) {, $Gene_Symbol still has the value it had in the last round of the while loop; it is in no way affected by the foreach loop. What you probably want is to match against the value that you stored in the %symbols hash for every key. That is,

    if ($symbols{$key} =~ /Hox/) {
Re: If Statements and Regular Expressions
by toolic (Bishop) on Sep 30, 2008 at 23:33 UTC
    If I'm not mistaken, your are trying to print out lines in which the "symbol" starts with the string "Hox". If that is the case, I see no need for building up a hash at all; just process the lines as you read them in. If you really want to re-order the columns, then this might be a simpler way to do things (UNTESTED):
    use strict; use warnings; open INPUT,"<MOUSE_TF1.txt" or die "can not open MOUSE_TF1.txt: $!"; open OUTPUT,">HOX_GENE_TF.txt" or die "can not open HOX_GENE_TF.txt: $ +!"; print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n +"; print OUTPUT "ENSEMBL GENE ID \tSYMBOL \tCHR\tSTART\t\tEND\t\tSTRAND\n +"; while (<INPUT>){ chomp; my ($id, $cname, $start, $end, $strand, $sym) = split /\t/; # select Gene Symbols belonging to "Hox" family and print if ($Sym =~ /^Hox/) { print join("\t", $id, $sym, $cname, $start, $e +nd, $strand), "\n" } } close INPUT; close OUTPUT;

    If you are willing to preserve the column order from the input, this simplifies even further to:

    use strict; use warnings; open INPUT,"<MOUSE_TF1.txt" or die "can not open MOUSE_TF1.txt: $!"; open OUTPUT,">HOX_GENE_TF.txt" or die "can not open HOX_GENE_TF.txt: $ +!"; print OUTPUT "MOUSE TRANSCRIPTION FACTORS IN THE HOX GENE FAMILY\n\n\n +"; print OUTPUT "ENSEMBL GENE ID \tCHR\tSTART\t\tEND\t\tSTRAND\tSYMBOL\n" +; while (<INPUT>){ chomp; my @items = split /\t/; # select Gene Symbols belonging to "Hox" family and print if ($items[5] =~ /^Hox/) { print "$_\n" } } close INPUT; close OUTPUT;
Re: If Statements and Regular Expressions
by JadeNB (Chaplain) on Sep 30, 2008 at 23:12 UTC
    In addition to the very useful replies above, note that you are matching the string Hox any time it occurs in the word. If you want it only at the beginning, then you need the regex /^Hox/.
Re: If Statements and Regular Expressions
by TempAcolyte (Initiate) on Oct 01, 2008 at 01:59 UTC
    Thank you *all* for your help and insights. Reading through your comments and code critiques (and the article link) was just what I needed tonight. I tightened up my code per your suggestions, changed:
    if ($Gene_Symbol =~ /Hox/) {
    to
    if ($symbols{$key} =~ /^Hox/) {
    and it worked like a charm. Now I not only have a script that works, but I also learned a few new approaches to try as well. (I am working with very large datasets, so I think I'll stick with five hashes.) Thanks again wise monks!! -- TempAcolyte
Re: If Statements and Regular Expressions
by JavaFan (Canon) on Sep 30, 2008 at 23:48 UTC
    Considering that what you basically do is finding lines that start with 'Hox', why don't you just do:
    $ grep ^Hox MOUSE_TF1.txt > HOX_GENE_TF.txt
    If the reordering of the columns is essential, you can always pipe the output of grep through col.