Re: Use Schwartzian transform across multiple files

Hi Sonya777!

I showed some working Schwartzian Transform (ST) code at Re: Use Schwartzian transform across multiple files. As a beginner, I would certainly consider the idea of a more straightforward approach. I recommend that you master the basics before trying to use advanced techniques. I show "another way" for you below.

The sort routine selects pairs of things to "judge". The user supplied function's job is to decide: less than, equal, or greater than. In the code below, there is a lot of "extra work" because a regex has to be run twice every time a new pair of "things" is selected for comparison. The ST is faster because it calculates all of the regex's only once and saves that result in an intermediate array before the actual sort is run.

However you should consider that often this extra efficiency doesn't matter at all in the overall scheme of things. In fact, for small numbers of lines, the ST can actually be slower due to the overhead of creating the intermediate array and transforming it back to the original representation.

How fast is "fast enough" depends upon the application. If you are sorting an array of say 80,000 elements, there probably will be a user noticeable difference between algorithms. With 100-200 lines, probably not.

Once you get your code working, I encourage you to benchmark the code below vs my ST version. Make the comparison as "fair as possible". Also be aware that the second time you run the program, it will run faster because the files will be in memory disk cache and that speeds things up a lot. But even so, you probably will learn something from doing a simple benchmark exercise. I don't know what OS you are using, but also be aware that on some OS'es. Windows in particular, console I/O is an extremely "expensive" operation and takes a lot of execution time. I/O to report benchmark progress can consume so much time that it skews the results.

#!/usr/bin/perl
use strict;
use warnings;

if (!-d "sorted")
{
    mkdir "sorted" or die "unable to create dir sorted $!";
}

my @files2sort = <file*.txt>;  #just use glob to get names
my $curfilenum =1;

foreach my $file (@files2sort)
{
   open my $fh_in, '<', $file  or die "$file failed to open $!";
   open(my $fh_out, '>', "./sorted/$file.sort") or die "cannot create 
+out $file.sort $!";
   
   print "Processing ".$curfilenum++." of ".@files2sort." $file\n";
   sortfile2($fh_in, $fh_out);

   close ($fh_out);
   close ($fh_in);
   print "OK: Sorted $file \n";
}


sub sortfile2
{
   my ($fh_in, $fh_out) = @_;
      
   my @lines = <$fh_in>;
      
   @lines = sort by_version @lines;
         
   print $fh_out @lines;  #can do a sort "in place"
                          #separate @sorted var is not needed.

}

sub by_version
{
   my ($verA) = $a =~ /VerNumber:\((\d+)/i;
   my ($verB) = $b =~ /VerNumber:\((\d+)/i;
          $verA <=> $verB    #returns -1,0,+1
}

__END__
Processing 1 of 2 file1.txt
OK: Sorted file1.txt 
Processing 2 of 2 file2.txt
OK: Sorted file2.txt
[download]

Comment on Re: Use Schwartzian transform across multiple files Download Code

Replies are listed 'Best First'.
Re^2: Use Schwartzian transform across multiple files by Sonya777 (Novice) on Sep 20, 2016 at 12:29 UTC
Hi Marshall! Thank you so much for your detailed response and guidance. I am a newbe in Perl and I got a bit lost with my attempt using as you say advanced options, so I really appreciate your comments. I agree that using the "glob" is easier and much more clear to me in this specific case. I decided to go with the ST code since I have 16000 files with approx. 4000 arrays each. So based on your guidance this would be the faster way. Most importantly, the ST script you have suggested works perfectly! I have just tried it on all the files and it was done in less than 5 minutes. Later on I will test and compare both scripts. With ST script you have suggested I got the desired result and I cannot thank you enough. It is always nice to see someone ready to share his knowledge with the world and it is a motivation for the rest of us to do the same! Cheers!	[reply]

Replies are listed 'Best First'.

Re^2: Use Schwartzian transform across multiple files
by Sonya777 (Novice) on Sep 20, 2016 at 12:29 UTC

Thank you so much for your detailed response and guidance. I am a newbe in Perl and I got a bit lost with my attempt using as you say advanced options, so I really appreciate your comments. I agree that using the "glob" is easier and much more clear to me in this specific case.

I decided to go with the ST code since I have 16000 files with approx. 4000 arrays each. So based on your guidance this would be the faster way.

Most importantly, the ST script you have suggested works perfectly! I have just tried it on all the files and it was done in less than 5 minutes. Later on I will test and compare both scripts. With ST script you have suggested I got the desired result and I cannot thank you enough.

It is always nice to see someone ready to share his knowledge with the world and it is a motivation for the rest of us to do the same!

[reply]