mav3067 has asked for the wisdom of the Perl Monks concerning the following question:

Hey everyone, I am trying to rename a very large dataset, approx. 260,000. The script below, which as I am no perl expert is possibly quite amature, performs the required task but is slow. I was wondering if anyone had any sugestions on how to increase the performance of this script. The reason to rename one file like another is that they are matching FASTA and QUALITY files for genetic sequencing reactions and it is preferable if they have the same name but under different directories. The FASTA files were renamed using information contianed inside the file itself but this is not possible with the quality files. the usage of the script is  ./qual_renamer.pl fasta ./QUAL qual Any ideas on how to increase the performance would be appreciated. thanks, wayne
#! /usr/bin/perl -w use strict; use Data::Dumper; my $fasta_file = $ARGV[0]; my $directory = $ARGV[1]; my $qual_file = $ARGV[2]; sub make_file_hash { my ($file) = @_; my %file_hash; open(FH, "< $file"); while (my $line = <FH>) { chomp($line); my @parts = split(/\//,$line); my $name = $parts[3]; if ($name =~ /(^.{3})([0-9]{4})(F|P|R)/) { my $key = $2 . "." . lc($3); $file_hash{$key} = $name; } elsif ($name =~ /(^.{2})([0-9]{8})(F|P|R)/) { my $key = $2 . "." . lc($3); $file_hash{$key} = $name; } else { print $name . " is crap\n"; } } return %file_hash; } sub rename { my ($file, $hash_ref, $dir) = @_; my %hash = %$hash_ref; #print Dumper(%hash); if ($file =~ /([0-9]{4,8})(.f|r|p)(.*)/) { my $key = $1 . $2; if (exists $hash{$key}) { my $filename = $hash{$key}; `mv $file $dir/$filename`; } else { print "no key $key\n"; } } } my %tmp_hash = &make_file_hash($fasta_file); open(FH2, "< $qual_file"); while( my $file = <FH2>) { chomp($file); &rename($file, \%tmp_hash, $directory) }
./FASTA/LO-leaves_drought/LO45829460F ./FASTA/LO-leaves_drought/LO45815650F ./FASTA/LO-leaves_drought/LO45852136R ./FASTA/LO-leaves_drought/LO45987989F ./FASTA/LO-leaves_drought/LO45959830F ./FASTA/LO-leaves_drought/LO45982398F ./FASTA/LO-leaves_drought/LO45990585F ./FASTA/LO-leaves_drought/LO46021000F ./FASTA/LO-leaves_drought/LO45815528R ./FASTA/LO-leaves_drought/LO45994910F ./FASTA/LO-leaves_drought/LO45925816F ./FASTA/LO-leaves_drought/LO45938935F ./FASTA/LO-leaves_drought/LO45782339F ./FASTA/LO-leaves_drought/LO46006257F ./FASTA/LO-leaves_drought/LO46018733F ./FASTA/LO-leaves_drought/LO45815795F ./FASTA/LO-leaves_drought/LO46006601F ./FASTA/LO-leaves_drought/LO45893249F ./FASTA/LO-leaves_drought/LO45953120F ./FASTA/LO-leaves_drought/LO46053350F ./FASTA/LO-leaves_drought/LO45978413F ./FASTA/LO-leaves_drought/LO45866607F ./FASTA/LO-leaves_drought/LO46017397F
./QUAL/48743192.f_a09_1.fasta.qual ./QUAL/51455741.f_b20_1.fasta.qual ./QUAL/42595949.f_n02_1.fasta.qual ./QUAL/51293480.f_g04_2.fasta.qual ./QUAL/42188856.f_h02_2.fasta.qual ./QUAL/51477163.f_m12_2.fasta.qual ./QUAL/42219670.f_d02_1.fasta.qual ./QUAL/46911125.f_p06_1.fasta.qual ./QUAL/44656731.f_c24_1.fasta.qual ./QUAL/BNP3104.p.scf.qual ./QUAL/42063951.f_p22_1.fasta.qual ./QUAL/42939137.f_j20_1.fasta.qual ./QUAL/42824374.f_k16_2.fasta.qual ./QUAL/49426321.f_h08_2.fasta.qual ./QUAL/48869367.r_k07_2.fasta.qual ./QUAL/48637192.f_h15_1.fasta.qual ./QUAL/45574303.f_d17_1.fasta.qual ./QUAL/46189823.f_n05_2.fasta.qual ./QUAL/BNP0977.p.scf.qual ./QUAL/49408758.f_i10_1.fasta.qual ./QUAL/50040300.f_a20_1.fasta.qual ./QUAL/48690145.f_l05_2.fasta.qual ./QUAL/46242796.f_g23_1.fasta.qual

Replies are listed 'Best First'.
Re: performance problems with renaming multiple files using another file name
by chromatic (Archbishop) on Jul 10, 2006 at 22:04 UTC

    Offhand, I can think of a few small ideas. I am not sure if there's an algorithmic improvement, but there are definitely some possibilities for improvement:

    `mv $file $dir/$filename`;

    Launching a new process to move a file is silly, especially with backticks as you ignore the results. Use rename instead.

    my %hash = %$hash_ref;

    This could be expensive if there are a lot of entries in the hash. Just dereference the elements you want anyway: exists $hash->{$key}, for example.

    /([0-9]{4,8})(.f|r|p)(.*)/

    You don't use the third capture (and it's a greedy, lots-of-backtracking operation), so you can remove it altogether.

    &make_file_hash($fasta_file);

    This won't have any effect on performance to my knowledge, but drop the leading &. It's unnecessary here.

Re: performance problems with renaming multiple files using another file name
by GrandFather (Saint) on Jul 10, 2006 at 22:02 UTC

    For a start change the name of your rename sub so it doesn't clash with Perl's rename function. Then you may be able to use rename $file $dir/$filename rather than `mv $file $dir/$filename` (perhaps depending on the OS you are using) to "rename" the file to a different directory. It's probably important that the target directory exist and be on the same drive.

    There may be other minor speed ups, but if the rename does the trick it will be the largest impovement you can make by a long way.


    DWIM is Perl's answer to Gödel
Re: performance problems with renaming multiple files using another file name
by jwkrahn (Abbot) on Jul 10, 2006 at 23:30 UTC
    Your code can be simplified to:
    my %file_hash; open my $fh, '<', $fasta_file or die "Cannot open '$fasta_file' $!"; while ( <$fh> ) { my ( $name ) = m!/([^/]+)$!; $name =~ m!((?<=^.{3})\d{4}|(?<=^.{2})\d{8})([FPR])$! or do { print "$name is crap\n"; next; }; $file_hash{ "$1.\L$2" } = $name; } close $fh; open my $qh, '<', $qual_file or die "Cannot open '$qual_file' $!"; while ( my $file = <$qh> ) { chomp $file; next unless $file =~ m!/((?:\d{4}|\d{8})\.[frp])!; if ( exists $file_hash{ $1 } ) { rename $file, "$dir/$file_hash{$1}" or die "Cannot rename '$fi +le' $!"; } else { print "no key $1\n"; } }
    Hopefully :) that may be faster, depending on the size of %file_hash of course.
Re: performance problems with renaming multiple files using another file name
by ambrus (Abbot) on Jul 10, 2006 at 22:05 UTC

    It seems that you are running the external program mv for every rename. I'm not sure if that's the performance hog in your case, but I think it would be better if you used perl's builting rename function instead, or, if you're moving to a different filesystem, the move function of the File::Copy module which is in the core of current perls.

    Update: to try out easily if that's the problem, it may work to comment out the backticks line and see how fast the script runs that way. Of course, it won't move the files then.