sandrider has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I need you wisdom again.

Question 1

open INFILE, "$ARGV[0]"; @aTryptic = <INFILE>; close INFILE; shift @aTryptic; for ($i=0;$i<@aTryptic;$i++) { $accSeq = SplitFields($aTryptic[$i], 'split'); $aTryptic[$i] =~ s/\r\n$/\n/g; chomp $aTryptic[$i]; $hTryptic{$accSeq} = $aTryptic[$i]; $aTryptic[$i] = $accSeq; } open INFILE, "$ARGV[1]"; @aSemiTryptic = <INFILE>; close INFILE; shift @aSemiTryptic; for ($i=0;$i<@aSemiTryptic;$i++) { $accSeq = SplitFields($aSemiTryptic[$i], 'split'); $aSemiTryptic[$i] =~ s/\r\n$/\n/g; chomp $aSemiTryptic[$i]; $hSemiTryptic{$accSeq} = $aSemiTryptic[$i]; $aSemiTryptic[$i] = $accSeq; }

What it does is just open a file passes into an array, then replace each cell with some part of the whole and also create a hash.

Question is, how can I make it into a sub so that I don't have a duplicate?

Question 2
2 files of data (strings of data that is separated by tabs, e.g. name\taddress\tetc). I need to compare the 2 I create 2 arrays and 2 hashes, the arrays contain just the field that I want to compare e.g. name and the hashes are created with the name as key and the rest is the values.

I compare the 2 arrays and if there is a common name I'll get the value from 1 hash and print in into a file, if there is none then I get the value from the other hash and print into the same file.

foreach $one (@arrayA) { chomp $one; $found = 0; foreach $two (@arrayB) { chomp $two; if ($one eq $two) { print "$hashB{$two}"; $found = 1; last; } } if ($found == 0) { print "$hashA{$one}"; } }

Is this the best way to do it?

Thanks.

Desmond

Replies are listed 'Best First'.
Re: Can I make a sub for this?
by Zaxo (Archbishop) on Aug 22, 2005 at 11:22 UTC

    To your first question, you can avoid slurping the file into an array by defining a sub to process the file one line at a time. I don't understand your SplitFields sub or its arguments, so I'll just assume pipe delimeters for the sake of argument.

    sub parse_file { my $filename = shift; my %hash; open my $fh, '<', $filename or die $!; local $_; while (<$fh>) { (my $key, $_) = split /\|/, $_, 2; s/\r\n$/\n/; $hash{$key} = $_; } return \%hash; } my $hTryptic = parse_file $ARGV[0]; my $hSemiTryptic = parse_file $ARGV[1];
    I had the sub return a reference to save the unnecessary copying of the hash's contents.

    For question 2, I think you can do better with hashes. Since you have arrays of keys for each, you can delete the keys of B from a copy of hash A, and the keys of A from hash B. Each deletion will return the values from the keys. If both hash copies turn up empty after that, they had identical key sets. Having the same keys for each, you can check values with a single scan over the key list.

    After Compline,
    Zaxo

Re: Can I make a sub for this?
by tilly (Archbishop) on Aug 22, 2005 at 17:00 UTC
    Random style advice.

    As perlstyle says, Always check the return codes of system calls.Good error messages should go to STDERR, include which program caused the problem, what the failed system call and arguments were, and (VERY IMPORTANT) should contain the standard system error message for what went wrong. So, for instance, where you write:

    open INFILE, "$ARGV[0]";
    you should write something like this:
    open INFILE, "<", $ARGV[0] or die "Can't read '$ARGV[0]': $!";
    And now if there is a problem opening that file, you'll get useful information.

    Secondly you should familiarize yourself with strict.pm and apply what it says. That will catch a lot of typos in your code.

    Thirdly you're using C-style for loops. Don't. Use Perlish for loops instead.

    # I'd use a better variable than $line, but I don't know # what your purpose is, and your Tryptic names are quite # cryptic for me. for my $line (@aTryptic) { my $accSeq = SplitFields($line], 'split'); $line =~ s/\r?\n\z//g; $hTryptic{$accSeq} = $line; $line = $accSeq; }
    This eliminates the possibility of off by one errors, is more efficient, and reduces possible typos.

    Fourth, in question 2, either make all data chomped or all data unchomped. Having to call chomp on data randomly before comparing them is a red flag.

    A more minor nit. I prefer using _ in variable names rather than camelCaseCapitalization.

    Oh, and about question 2, rather than repeatedly scan an array, arrange to use a hash lookup.

    my %in_array_b; @in_array_b{@array_b} = (); foreach my $one (@array_a) { print $in_array_b{$one} ? $hash_a{$one} : $hash_b{one}; }
Re: Can I make a sub for this?
by eric256 (Parson) on Aug 22, 2005 at 20:08 UTC

    For Q2:

    As long as you already have 2 hashs, each keyed by what you would like to match for duplicate, let perl do the work for you.

    my $finalHash = { %$hashA, %$hashB };

    That dumps $hashA into the finalHash, then dumps $hashB in. Anywhere the keys overlap B will win, otherwise you get all matches from both hashes.


    ___________
    Eric Hodges

      Hi Eric,

      Thanks for the tip. The script I have would take hashA find those that are in hashB, let me write it in pseudocodes

      if key{hashA} eq key{hashB} { print FH value{hashA} } if not found then { print FH value{hashB} }
      which means I only want those that match from hashA and those that don't match from hashB, so some of hashA will not be in the printout. Can it be done by a method similiar to your suggestion?

      Thanks.

      Desmond