Re^2: redirect output from a command to another command

Replies are listed 'Best First'.
Re^3: redirect output from a command to another command by ikegami (Patriarch) on Mar 02, 2011 at 17:19 UTC
I would think launching `diff` would be relatively expensive. The disk cache should eliminate all disk wait for small files. Have you considered Algorithm::Diff?	[reply] [d/l]
Re^4: redirect output from a command to another command by Allasso (Monk) on Mar 02, 2011 at 20:31 UTC
Yes, I thought about that, but I have not been inclined toward the use of Cpan modules. For one, I lose portability. Also, in the past whenever I looked into using one (can't remember which ones now) I always got discouraged by the lack of simple documentation, and instruction on how to implement. All I could ever find was scraps of this and that. EDIT: Though I must say, the Cpan page on diff is the best I've seen.	[reply]
Re^5: redirect output from a command to another command by ikegami (Patriarch) on Mar 02, 2011 at 21:02 UTC
I lose portability You'd gain portability.	[reply]
Re^6: redirect output from a command to another command by Allasso (Monk) on Mar 02, 2011 at 21:31 UTC
Re^7: redirect output from a command to another command by ikegami (Patriarch) on Mar 03, 2011 at 00:16 UTC
Re^4: redirect output from a command to another command by Allasso (Monk) on Mar 03, 2011 at 03:05 UTC
UPDATE I will be posting results from more extensive testing to show what these figures are refleting soon I did a speed comparison between using Algorithm::Diff, and writing all the data to temp files and calling system diff. Repeated execution on 105 html files. using system diff: 1.9s average using Algorithm:Diff: 95s average Not too impressed with Algorithm:Diff...	[reply]
Re^5: redirect output from a command to another command by Anonymous Monk on Mar 03, 2011 at 03:51 UTC
Which version of each? Care to share your benchmark?	[reply]
Re^5: redirect output from a command to another command by Allasso (Monk) on Mar 03, 2011 at 19:13 UTC
Some further investigation revealed some interesting results. First, I will note the conditions of the tests, as were used according to the specific needs I have for which I am using the algorithm. The main idea is, I want to compare words (strings of characters separated by whitespace) in two files. I am not concerned about changes in whitespace in the comparisons, so all groups of consecutive whitespace are collapsed to single \n characters. This of course was the necessary character to use for preprocessing for using diffutils diff, and for consistency, I left it the same in using the CPAN module. For the CPAN module method, I used the example code from the CPAN Algorithm::Diff webpage to perform the actual comparison. The files were read into scalars, the substitutions were done, then the modified scalars were split into arrays at the \n's. These arrays are what is then used by the example code. For the diffutils method, the files were read into scalars, subs made, then the modified scalars were written to temp files, the names of which were used as arguments to the diff command, being executed from the script. Ultimately, I want to do a recursive comparison of file hierarchies, but for the sake of getting some clearer data from comparing the two algorithms, I first ran tests comparing the same two files numerous times, then compared the results yielded from the testing of each algorithm. This test would yield the closest comparison of strictly the algorithm itself. (with one possibly disputable exception which I will elaborate on below). While the results from this test still revealed the diffutils method to be quite a bit faster, they were not the dramatic 45 fold difference that I observed yesterday. (more on the order of 3.3 times) However, I still needed to test what I would really be doing, which is a recursive comparison. It was when I did these test that they revealed a 55 fold increase in time using the CPAN module. I do not understand the reason for such disporportionate results. I have carefully laid out my methods and code below. ----------- Tested on iMac G5 1.8 GHz PPC, 1 GB ram, OS 10.4.11. Diff::Algorithm version 1.1902 diffutils version 2.8.1 First test was run comparing the algorithms alone, running the same two files 1000 times. This test was performed 5 times for each algorithm. This was done twice, alternating between the two. The files used were html files of approx 28kB each in length. They were not identical. Results were: `CPAN Algorithm::Diff method: time to compare same two files 1000 times: 40.28 sec 39.68 sec 39.97 sec 40.17 sec 39.70 sec 39.71 sec 39.61 sec 39.82 sec 39.71 sec 39.60 sec avg. = 39.83 sec diffutils method: time to compare same two files 1000 times: 11.68 sec 11.72 sec 11.63 sec 11.69 sec 11.78 sec 11.62 sec 11.62 sec 11.78 sec 11.66 sec 11.62 sec avg. = 11.68 sec` [download] Second test was run doing a recursive comparison of two directories each parenting 105 html files. About half of the files were not identical. The total was approx. 2.6 MB for each tree. The recursion is iterated 10 times. I ran this test 10 times using diffutils method, and 2 times using the Algorithm::Diff method. After not being comfortable with my cpu running at the rail for 15 minutes, I then ran the Algorithm::Diff method iterating over the recursion once, then giving it a rest, and repeating. I repeated this 8 times. I alternated between the using the two algorithms. results were: `CPAN Algorithm::Diff method: time to compare 105 file pairs, 10 times: 926.7 sec 924.2 sec time to compare 105 file pairs, 1 time: 91.02 sec 91.09 sec 91.09 sec 91.10 sec 90.93 sec 91.19 sec 93.42 sec 91.58 sec avg time for a single comparison of 105 file pairs: 92.23 secs diffutils method: time to compare 105 file pairs, 10 times: 16.76 sec 16.65 sec 16.67 sec 16.68 sec 16.85 sec 16.71 sec 16.72 sec 16.80 sec 16.92 sec 16.84 sec avg. time for single comparison of 105 file pairs: 1.676 secs` [download] Summary of tests: `repeatedly compare same two files 1000 times: average times: Algorithm::Diff 39.83 sec diffutils 11.68 sec compare 105 different pairs of files 1 time: average times: Algorithm::Diff 92.23 sec diffutils 1.676 sec` [download] I will note that someone may dispute that in the first set of tests, in the case of testing with the Algorithm::Diff method, the operation of splitting the text string which is done on every iteration in the timing loop is not purely testing the algorithm alone. While this may be true, I did it this way so it would be a 1 to 1 comparison in the context of what I was trying to accomplish. IE, I wanted to have the same framework code, and just be able to interchange the two methods. However, for the sake of fairness to the algorithm, I removed the split out of the timing loop and in performing the test 5 times the average time for 1000 iterations went to 26.92 sec. (I refrain from posting all the data on that). However, it should be noted that for the second test, it was necessary to have the split in the loop, since we are comparing different files every time. --------------- Here is the code I used in the tests: code used to run the same file 1000 times: ## this is the framework: ## One of the two code snippets below are ## substituted for ### DIFF ALGORITHM HERE.. #!/usr/bin/perl use strict; use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl"; require Algorithm::Diff; use Time::HiRes qw( time ); my($source_path_1, $source_path_2) = @ARGV; my $holdRS = $/; local $/; if (! open(FH, $source_path_1)) { print "unable to open source file 1: $source_path_1\n"; } my $filestring_1 = <FH>; $/ = $holdRS; close(FH); $holdRS = $/; local $/; if (! open(FH, $source_path_2)) { print "unable to open source file 2: $source_path_2\n"; } my $filestring_2 = <FH>; $/ = $holdRS; close(FH); $filestring_1 =~ s@\s+@\n@g; $filestring_2 =~ s@\s+@\n@g; my $time = time(); for my $count (0..999) { ### DIFF ALGORITHM HERE.. } my $time_4sig = time() - $time + .005; $time_4sig =~ s@^(.....).@$1@; print STDERR "\n\net: ".$time_4sig."\n"; exit; ## this is the CPAN Algorithm::Diff code: my @seq1 = split(/\n/, $filestring_1); my @seq2 = split(/\n/, $filestring_2); my $diff = Algorithm::Diff->new( \@seq1, \@seq2 ); $diff->Base( 1 ); # Return line numbers, not indices while( $diff->Next() ) { next if $diff->Same(); my $sep = ''; if( ! $diff->Items(2) ) { printf "%d,%dd%d\n", $diff->Get(qw( Min1 Max1 Max2 )); } elsif( ! $diff->Items(1) ) { printf "%da%d,%d\n", $diff->Get(qw( Max1 Min2 Max2 )); } else { $sep = "\n---\n"; printf "%d,%dc%d,%d\n", $diff->Get(qw( Min1 Max1 Min2 Max2 )); } print "< $_" for $diff->Items(1); print $sep; print "> $_\n" for $diff->Items(2); } ## this is the diffutils code: if (! open(FH, ">/tmp/diff_774885959483_1")) { print "unable to open temporary file\n"; } print FH "$filestring_1"; close (FH); if (! open(FH, ">/tmp/diff_774885959483_2")) { print "unable to open temporary file\n"; } print FH "$filestring_2"; close (FH); print "$source_path_1 ::: $source_path_2\n"; print `diff --suppress-common-lines -y /tmp/diff_774885959483_1 /t +mp/diff_774885959483_2`; [download] This is the framework for the recursive comparison of 105 files, in which one of the two code snippets posted directly above were substituted for `## DIFF algorithm here`. #!/usr/bin/perl use strict; use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl"; require Algorithm::Diff; use Time::HiRes qw( time ); my($source_path_1, $source_path_2) = @ARGV; $source_path_1 =~ s@\x2f$@@; $source_path_2 =~ s@\x2f$@@; my @src_list_1 = `find $source_path_1 -name ".htm"`; my @src_list_2 = `find $source_path_2 -name ".htm"`; my $time = time(); for my $count (0..9) { my $list_cnt = 0; for my $file_src_1 (@src_list_1) { my $file_src_2 = $src_list_2[$list_cnt++]; chomp $file_src_1; chomp $file_src_2; my $holdRS = $/; local $/; if (! open(FH, $file_src_1)) { print "unable to open source file 1: $file_src_1\n"; } my $filestring_1 = <FH>; $/ = $holdRS; close(FH); $holdRS = $/; local $/; if (! open(FH, $file_src_2)) { print "unable to open source file 2: $file_src_2\n"; } my $filestring_2 = <FH>; $/ = $holdRS; close(FH); $filestring_1 =~ s@\s+@\n@g; $filestring_2 =~ s@\s+@\n@g; ### DIFF ALGORITHM HERE.. } } my $time_4sig = time() - $time + .005; $time_4sig =~ s@^(.....).@$1@; print STDERR "\n\net: ".$time_4sig."\n"; exit; [download] I am also posting the full script for each method in which a recursive comparison was done (in which was yielded the curiously slow output using the CPAN module), copied and pasted directly after performing the tests for each method. I am doing this so eliminate any question about the posted code not reflecting the actual test: ## full recursive script using CPAN Algorithm::Diff : #!/usr/bin/perl use strict; use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl"; require Algorithm::Diff; use Time::HiRes qw( time ); my($source_path_1, $source_path_2) = @ARGV; $source_path_1 =~ s@\x2f$@@; $source_path_2 =~ s@\x2f$@@; my @src_list_1 = `find $source_path_1 -name ".htm"`; my @src_list_2 = `find $source_path_2 -name ".htm"`; my $time = time(); for my $count (0..9) { my $list_cnt = 0; for my $file_src_1 (@src_list_1) { my $file_src_2 = $src_list_2[$list_cnt++]; chomp $file_src_1; chomp $file_src_2; my $holdRS = $/; local $/; if (! open(FH, $file_src_1)) { print "unable to open source file 1: $file_src_1\n"; } my $filestring_1 = <FH>; $/ = $holdRS; close(FH); $holdRS = $/; local $/; if (! open(FH, $file_src_2)) { print "unable to open source file 2: $file_src_2\n"; } my $filestring_2 = <FH>; $/ = $holdRS; close(FH); $filestring_1 =~ s@\s+@\n@g; $filestring_2 =~ s@\s+@\n@g; ## begin CPAN algorithm: my @seq1 = split(/\n/, $filestring_1); my @seq2 = split(/\n/, $filestring_2); my $diff = Algorithm::Diff->new( \@seq1, \@seq2 ); $diff->Base( 1 ); # Return line numbers, not indices while( $diff->Next() ) { next if $diff->Same(); my $sep = ''; if( ! $diff->Items(2) ) { printf "%d,%dd%d\n", $diff->Get(qw( Min1 Max1 Max2 )); } elsif( ! $diff->Items(1) ) { printf "%da%d,%d\n", $diff->Get(qw( Max1 Min2 Max2 )); } else { $sep = "\n---\n"; printf "%d,%dc%d,%d\n", $diff->Get(qw( Min1 Max1 Min2 Max2 )); } print "< $_" for $diff->Items(1); print $sep; print "> $_\n" for $diff->Items(2); } ## end CPAN algorithm } } my $time_4sig = time() - $time + .005; $time_4sig =~ s@^(.....).@$1@; print STDERR "\n\net: ".$time_4sig."\n"; exit; ## full recursive script using diffutils : #!/usr/bin/perl use strict; use lib "/Users/allasso/AWS/utility/cpan/lib/perl5/site_perl"; require Algorithm::Diff; use Time::HiRes qw( time ); my($source_path_1, $source_path_2) = @ARGV; $source_path_1 =~ s@\x2f$@@; $source_path_2 =~ s@\x2f$@@; my @src_list_1 = `find $source_path_1 -name ".htm"`; my @src_list_2 = `find $source_path_2 -name ".htm"`; my $time = time(); for my $count (0..9) { my $list_cnt = 0; for my $file_src_1 (@src_list_1) { my $file_src_2 = $src_list_2[$list_cnt++]; chomp $file_src_1; chomp $file_src_2; my $holdRS = $/; local $/; if (! open(FH, $file_src_1)) { print "unable to open source file 1: $file_src_1\n"; } my $filestring_1 = <FH>; $/ = $holdRS; close(FH); $holdRS = $/; local $/; if (! open(FH, $file_src_2)) { print "unable to open source file 2: $file_src_2\n"; } my $filestring_2 = <FH>; $/ = $holdRS; close(FH); $filestring_1 =~ s@\s+@\n@g; $filestring_2 =~ s@\s+@\n@g; ## begin diffutils algorithm: if (! open(FH, ">/tmp/diff_774885959483_1")) { print "unable to open temporary file\n"; } print FH "$filestring_1"; close (FH); if (! open(FH, ">/tmp/diff_774885959483_2")) { print "unable to open temporary file\n"; } print FH "$filestring_2"; close (FH); #print "$file_src_1 ::: $file_src_1\n"; print `diff --suppress-common-lines -y /tmp/diff_774885959 +483_1 /tmp/diff_774885959483_2`; ## end diffutils algorithm } } my $time_4sig = time() - $time + .005; $time_4sig =~ s@^(.....).@$1@; print STDERR "\n\net: ".$time_4sig."\n"; exit; [download]	[reply] [d/l] [select]
Re^3: redirect output from a command to another command by roboticus (Chancellor) on Mar 07, 2011 at 14:09 UTC
Allasso: If you write to the file, and then use and delete it shortly thereafter, it may not even get written to the disk at all. It may simply reside in memory buffers. So don't be afraid of short-term temporary files. They can even be handy debugging tools--just comment out the delete, so you can see what the intermediate results were in an operation. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^4: redirect output from a command to another command by Allasso (Monk) on Mar 09, 2011 at 23:11 UTC
yes, by the results I was seeing, I had a feeling that that was what was happening, ie, they weren't even going to disk. I was getting caught in the paradigm that only variables resided in memory. Another paradigm that I am wondering if it is fallacious is that calling a system command is more expensive than using a module. Is there really a (significant) difference between forking a process and executing code that is written in from a module? If there is a slight cost to forking, it doesn't seem like it would be that significant. Some one enlighten me. Anyway, looking at the results of my tests, it is hard to convince me that there is anything to be gained by using Algorithm::Diff, as far as speed goes. Portability, perhaps, as some have pointed out.	[reply]