in reply to If Statements and Regular Expressions

There is a pile of stuff that can be tidied up there. First off, always use strictures (use strict; use warnings;).

Use chomp instead of chop.

Use the three parameter version of open and check the result. Use lexical file handles:

open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!";

Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash:

use warnings; use strict; open my $inFile, '<', "MOUSE_TF1.txt" or die "Failed to open MOUSE_TF1 +.txt: $!"; my %families; my @fields = qw(chr start end symbol strand); while (<$inFile>) { my $line = $_; chomp ($line); my @Gene_Info = split "\t", $line; my $id = shift @Gene_Info; @{$families{$id}}{@fields} = @Gene_Info; } close $inFile; # select Gene Symbols belonging to "Hox" family and print foreach my $key (keys %families) { if ($key =~ /Hox/) { print join ("\t", $key, @{$families{$key}}{@fields}), "\n"; } }

untested


Perl reduces RSI - it saves typing

Replies are listed 'Best First'.
Re^2: If Statements and Regular Expressions
by JavaFan (Canon) on Oct 01, 2008 at 00:12 UTC
    Instead of using "parallel" data structures that have to be handled piecemeal, group common data using a hash

    But that uses more memory. Which can be significant if you have a lot of data.

    #!/usr/bin/perl use 5.010; use strict; use warnings; use Devel::Size qw [total_size]; my $size = 250_000; my @fields = qw [chr start end symbol strand]; my %families; my @structs = \my (%chr, %start, %end, %symbol, %strand); foreach my $key (1 .. $size) { $families{$key}{$_} = undef for @fields; $$_{$key} = undef for @structs; } my $s1 = total_size \%families; my $s2 = 0; $s2 += total_size $_ for @structs; printf "Big hash: %d Mb\n", $s1 / (1024 * 1024); printf "More hashes: %d Mb\n", $s2 / (1024 * 1024); printf "Savings: %.0f%%\n", 100 * ($s1 - $s2) / $s1; __END__ Big hash: 71 Mb More hashes: 61 Mb Savings: 14%
    Grouping the data in a hash causes you to have 1250001 hashes, instead of just 5. Which carries a 10 Mb penalty.

      Interesting. Using Perl 5.8.8 rather than 5.10 makes the same saving for 'Big hash':

      Big hash: 61 Mb More hashes: 52 Mb Savings: 16%

      so should we all use 5.8.8 rather than 5.10? Actually 14 (or 16) % is sufficiently small to be irrelevant for most purposes. Clarity of code is the more important factor (not that I'm claiming that my code is any clearer mind you), especially for a first cut - optimize later if you need to.


      Perl reduces RSI - it saves typing
        Whether the code becomes clearer when using a big hash is subjective. In your example, you got away by not having to use a different order of the fields than you define, but the original code doesn't have that luxery, as the output has the fields in a different order than the input.

        Which means, you either have to use string literals as key names or put the field names into separate variables. The former loses you some of the benefits of using 'use strict' as misspellings in field names aren't found by the compiler (and not even by the runtime, 'use warnings' won't help you either). In the latter case, you're just shifting what you are trying to avoid. There's little difference in:

        my %big_hash; my $field_name1 = "field_name1"; my $field_name2 = "field_name2"; ... $big_hash{$key}{$field_name1} $big_hash{$key}{$field_name2} ...
        or
        my %field_name1; my %field_name2; ... $field_name1{$key} $field_name2{$key}
        I don't want to dogma "always group in a big hash instead of using parallel datastructures", nor the opposite. There are pros and cons.