Re: How can I improve my inefficient codes?

Some of my comments will be based on code improvements with no speed gain, others may address your speed concern.

open (FILE1, "Mock3_SNP.txt"); Use the three-arg form of open, use a lexical, check errors: open my $input, '<', 'Mock3_SNP.txt' or die "Can't read Mock3_SNP.txt: $!"; - even better is to have the name of the input in a variable that could be passed in on the commandline in the future, but I digress.
```
while (<FILE1>) {
    chomp;
    my $input_orig = $_;
[download]
```
You're reading the file, then copying it to a new variable? And then doing nothing with the new variable? Using a variable other than $_ is good, but no point in duplicating everything around:
```
while (defined (my $input_line = <$input>)) {
  chomp $input_line;
[download]
```
my @line = split /\s+/, $_; Generally speaking, if you're splitting on spaces, you really should use ' ' instead of /\s+/. It Does The Right Thing more often. It is not, however, exactly the same. See the documentation on split for details. (IIRC, this also means you can get rid of the chomp as the split will implicitly do so.)
```
my @line = split ' ', $input_line;
[download]
```
my $chr = "chr$line[1]"; That's an interesting trick. However, if chr values are never very large, i.e., over say 10,000, we could take moritz' suggestion and turn it on its head. Instead of putting this in a hash, put it in an array based on the chr number. Instead of my %input;, use my @input;. This sparse array will take up a fair bit of space, but should be faster for lookups than a hash. However, if the size gets too big then we get the problem of running out of RAM, starting a swap, and that will definitely be slow. However, if I read your node properly, the numbers will be 1-36, so definitely no problem whatsoever.

    $input{$name}[0] = $name;
    $input{$name}[1] = $chr;
    $input{$name}[2] = $pos;
[download]

This can be done much simpler:

$input{$name} = [ $name, $chr, $pos ];
[download]

(though why you need to duplicate $name, I don't know.) Of course, I just suggested turning it on its head and moritz pointed out collisions, so we get:

push @{$input[$chr]} => [ $name, $pos ];
[download]

Mind you, I generally prefer named values, so I'd go with:

push @{$input[$chr]} => { name => $name, pos => $pos };
[download]

```
open (OUT, "> test.txt"); ### Change if file name changes
[download]
```
Again, three-arg, lexical filehandle, check if open failed. However, the comment here really drives home the point I made with the previous open: put it in a variable at the top, which will make it easier for someone to change, but also make it easier to convert into a command line parameter later.
```
### At the top:
my $output_file = 'test.txt'; ### Change if the file name changes

### the open is then:
open my $out, '>', $output_file or die "Can't open $output_file: $!";
[download]
```
A bunch of repeat comments since you're handling things about the same as before. And then to the part moritz commented on:
```
    foreach my $name(keys %input) {
        if ($input{$name}[1] eq $chr) {
            if ($input{$name}[2] > $pos-1 && $input{$name}[2] < $pos+3
+6) {
[download]
```
This is now just a simple lookup. First, we have to convert $chr to just the number, and then see if there is one or more input entry that matches.
```
$chr =~ s/^chr//; # remove chr if it's there
### or, if chr must be there:
# substr $chr, 0, 3, ''; # removes first three characters
if (defined $input[$chr]) {
  for my $i (@{$input[$chr]}) {
    if ($pos <= $i->{pos} && $i->{pos} <= $pos+35) {
[download]
```
If you use the array of arrays instead of the array of hashes that I prefer, that'd be $i->[1] instead of $i->{pos}. (I think the AoA will be faster, but I'd suggest benchmarking it to see if it's significant. Readability is still nice.)

                if ($input{$name}[2] == $pos) {
                    my $fSNP = substr($seq, 0, 1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] == $pos+35) {
                    my $fSNP = substr($seq, 35,1);
                    print OUT "$fSNP\n";    
                }
                if ($input{$name}[2] != $pos+35 && $input{$name}[2] !=
+ $pos) {
                    my $SNP = substr($seq, 0, $input{$name}[2]-($pos-1
+));
                    my $fSNP = substr($SNP, $input{$name}[2]-$pos);   
+     
                    print OUT "$fSNP\n";    
                }
[download]

A few points. First, this should be if/elsif/else since all the if blocks are mutually exclusive. That would cut down on the comparisons. Second, if you're using perl 5.10+, using given/when might be better since it's all the same item you're checking against, $input{$name}[2] (or now $i->{pos} or $i->[1]). Third, it looks like the last one handles all the cases, but somewhat poorly.

my $fSNP = substr($seq, $i->{pos} - $pos, 1);
print $out "$fSNP\n";
[download]

Untested, and I may be misreading this, but if I have read it properly, this should speed things up since you only need to copy one character once rather than a bunch of characters and then one character.

Hope that helps

Comment on Re: How can I improve my inefficient codes? Select or Download Code