script optmization

shoura has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: script optmization by 1nickt (Canon) on May 14, 2017 at 16:37 UTC
Hi, in two threads now you are asking for advice on how to mark matching text substrings with some special marker, by hand, in a very naive way. I suppose you are then going through and looking for the marked substrings to do something else with them. If you described at a higher level what it is you want to accomplish, the monks could probably provide advice that would get you to your goal in a much simpler way, and coincidentally you'll probably get the optimization you seek. Perl is the most useful tool set for processing text. It's almost certain that there are modules on CPAN to do what you want. Part of learning to be a programmer is learning which tools already exist and which are best for the job, not always to make a new tool. Your code as presented could be improved in quite a few ways, but I for one am not inclined to spend time on it since it so obviously is the wrong approach to something. So, care to describe your real task and goal? ( See also XY Problem; Premature Optimization ) Hope this helps! The way forward always starts with a minimal test.	[reply]
Re: script optmization by tybalt89 (Monsignor) on May 14, 2017 at 21:13 UTC
Here's one that does case-insensitive matching and preserves any capitalization. It just needs slight modification to read and write files. :) #!/usr/bin/perl # http://perlmonks.org/?node_id=1190261 use strict; use warnings; my $seq_txt_file = <<END; scooped up by social travesty without proper sanitation END my @seqs = sort { length $b <=> length $a } map s/^\s+\|\s+$//gr, split /\n/, $seq_txt_file; my $regex = do { local $" = '\|'; qr/(@seqs)/i }; print s/$regex/ $1 =~ s!\h+!bbb!gr /ger while <DATA>; __DATA__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation. Another test with capitals, a Social Travesty if I've ever seen one wi +thout Proper Sanitation. [download]	[reply] [d/l]
Re: script optmization by BillKSmith (Monsignor) on May 14, 2017 at 14:56 UTC
None of the solutions given so far would find a word that is divided at the end of a line. Is this as issue? What about capitalization? Embedded in longer words? Plurals, etc? I doubt that it would help your speed problem, but your 'bylen' is good candidate for a module. In fact, it is an example in the documentation for List::UtilsBy. Bill	[reply]
Re: script optmization by kcott (Archbishop) on May 15, 2017 at 03:53 UTC
G'day shoura, Here's some techniques for removing redundant and duplicated processing. There's a single pass through the `SEQ` file. Trimming leading and trailing whitespace, as well as the line terminator, uses a single regex capture. The replacement text, for later use, is calculated once here. The regex for use when processing the `TXT` file is created once only. There's a single pass through the `TXT` file; and there's no nesting. The substitution, using values already calculated, is the only processing done here. I've used Inline::Files as the actual opening and closing of your files doesn't appear to need optimising. Having said that, I would urge you to use lexical filehandles throughout, and the 3-argument form of open. #!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %seq; while (<SEQ>) { my ($trim) = /^\s(.?)\s*$/; ($seq{$trim} = $trim) =~ s/\h+/bbb/g; } my $re = qr{(@{[join '\|', sort { length $b <=> length $a } keys %seq]} +)}; while (<TXT>) { s/$re/$seq{$1}/g; print; } __SEQ__ scooped up by social travesty without proper sanitation __TXT__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation. [download] Output: `Many of them are scoopedbbbupbbbby chambermaids, thrown into bin bags +and sent off to landfill sites, which is a disaster for the environme +nt and a socialbbbtravesty given that many people around the world ar +e going withoutbbbproperbbbsanitation.` [download] — Ken	[reply] [d/l] [select]
Re^2: script optmization by Marshall (Canon) on May 15, 2017 at 23:21 UTC
Ken, I really like your post++. A couple of very,very minor nits which I show in code below: I think the fastest way to remove leading and trailing white space is like the code below, using 2 Perl statements instead of `$string =~ s/^\s+\|\s+$//g` or your `my ($trim) = /^\s(.?)\s*$/;`. The Perl documentation talks about this somewhere in the regex docs. But a quick search didn't find this quickly otherwise I would post a link. Anyway, the explanation goes that regex engine works best with fixed anchors and that 2 very easy regex statements run faster than a single more complex one. I split your $re statement into two parts to simplify the syntax. Creating an intermediate variable is very "cheap". I didn't benchmark, but your code creates an anon array which is then de-referenced. My code only creates a scalar, which in general will be faster. I see no need at all to sort the search terms, so I didn't do that. The regex is going to match any of the 3 or'd "search phrases" no matter what the order in the regex is. Changing the order in the regex will not necessarily result in any performance change at all. The OP's requirement "for a sorted order" makes no sense to me at all. I see some suggestion to use threads or other parallel processing strategies. It appears to me that this will be an I/O bound application and such complex things won't matter at all. Having said the above. Neither point makes a darn bit of difference in this case. I made this post because point (1) has relevance beyond this Op's question. For performance: The "setup" won't matter much because this is done once. Then: Read Line, Run Regex, Print Line is about as fast as this usually gets without complicated heroics. Another Monk queried about the OP's purpose? Sometimes a post is just an academic question. Sounds like there is some real application here that we don't understand. The only reason to put these "markers" into the text is for later processing. Maybe that processing, whatever it is, can be combined into a single step? That could lead to a big speed increase. I mean that second step of processing will have to search the entire text to find the bbb markers yet again. #!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %seq; # example: 'scooped up again' => 'scoopedbbbupbbbagain', while (my $line = <SEQ>) { $line =~ s/^\s+//; $line =~ s/\s+$//; ($seq{$line} = $line) =~ s/\h+/bbb/g; } my $search_phrases = join '\|', keys %seq; my $re = qr{($search_phrases)}; while (<TXT>) { s/$re/$seq{$1}/g; print; } __SEQ__ scooped up by social travesty without proper sanitation __TXT__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation. [download]	[reply] [d/l] [select]
Re^3: script optmization by kcott (Archbishop) on May 16, 2017 at 07:08 UTC
G'day Marshall, Thanks for the positive feedback. I have some comments on your first three points. Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (`/^\s(.?)\s$/`). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark: #!/usr/bin/env perl -l use strict; use warnings; use constant STRING => " \t aaa bbb ccc \t \n"; use Benchmark 'cmpthese'; print 'Sanity Tests:'; print 'shoura: >', shoura_code(), '<'; print 'kcott: >', kcott_code(), '<'; print 'marshall: >', marshall_code(), '<'; cmpthese 0 => { S => \&shoura_code, K => \&kcott_code, M => \&marshall_code, }; sub shoura_code { local $_ = STRING; chomp; s/^\s+\|\s+$//g; return $_; } sub kcott_code { local $_ = STRING; ($_) = /^\s(.?)\s$/; return $_; } sub marshall_code { local $_ = STRING; s/^\s+//; s/\s+$//; return $_; } [download] I ran it five times — that's usual for me — here's the result that was closest to an average: `Sanity Tests: shoura: >aaa bbb ccc< kcott: >aaa bbb ccc< marshall: >aaa bbb ccc< Rate S M K S 292306/s -- -32% -37% M 432626/s 48% -- -7% K 464863/s 59% 7% --` [download] There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'. Re "... split your $re statement into two parts ...". I often use the '`@{[...]}`' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds. <Reveal this spoiler or all in this thread> Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark: `my %seq = ( 'W X Y' => 'WbbbXbbbY', 'X Y' => 'XbbbY', 'X Y Z' => 'XbbbYbbbZ', );` [download] If the target string was `"W X Y Z"`, the results could one of these three: `W XbbbY Z WbbbXbbbY Z W XbbbYbbbZ` [download] Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this: `sort { length $b <=> length $a \|\| $a cmp $b }` [download] But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored. — Ken	[reply] [d/l] [select]
Re: script optmization by Anonymous Monk on May 14, 2017 at 13:40 UTC
I don't understand the purpose of what you're doing, but you can move some of the work outside the main loop like this: `my @pat; foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; push @pat, [ $r, $t ]; } while (<$fh>) { foreach my $p (@pat) { s/$p->[0]/$p->[1]/g; } print Newfile; }` [download] And if you want to get fancy, maybe even this: `my $seq = join '\|', @seq; my %seq; foreach my $r (@seq) { (my $t = $r) =~ s/\h+/bbb/g; $seq{$r} = $t; } while (<$fh>) { s/($seq)/$seq{$1}/go; print Newfile; }` [download]	[reply] [d/l] [select]
Re^2: script optmization by shoura (Novice) on May 14, 2017 at 14:41 UTC
sorry but this don't work	[reply]
Re^3: script optmization by Anonymous Monk on May 14, 2017 at 17:13 UTC
What does "don't work" mean? If you want help, you're going to have to give us more information.	[reply]
Re: script optmization by Anonymous Monk on May 14, 2017 at 21:33 UTC
Loading 200 MB into available memory is likely possible on today's hardward. If so, the following runs ~ 7 times faster. The idea here is iteratating over seq one time. For larger data files, one can chunk 300 MB at a time time and not forgetting to read till the end of line to have a complete chunk. Then process the chunk similarly. use strict; use warnings; use autodie; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+\|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. my $data; { open($fh, $f2); local $/; $data = <$fh>; } foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $data =~ s/$r/$t/g; } print Newfile $data; close Newfile ; exit 0; sub bylen { length($b) <=> length($a); } [download]	[reply] [d/l]
Re^2: script optmization by Anonymous Monk on May 14, 2017 at 22:04 UTC
For larger data files and not wanting to deal with chunking manually, then there is the parallel MCE module. This is what one might construct using the MCE::Flow module. We're running 4 workers. Therefore chunking at 24 MB is plenty. Perl and CPAN are amazing allowing this. use strict; use warnings; use autodie; use MCE::Flow; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; Newfile->autoflush(1); # important, enable autoflush my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+\|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. MCE::Flow::init { max_workers => 4, chunk_size => '24m', init_relay => 1, use_slurpio => 1, }; # For best performance, provide MCE the path, e.g. $f2 # versus a file handle. Workers communicate among themselves # the next offset without involving the manager process. mce_flow_f sub { my ($mce, $slurp_ref, $chunk_id) = @_; foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $$slurp_ref =~ s/$r/$t/g; } # Relay capability is useful for running something orderly. # For this use case, we've enabled autoflush on the file above. # Only one worker is allowed to run when entering the block. MCE::relay sub { print Newfile $$slurp_ref }; }, $f2; MCE::Flow::finish(); close Newfile; exit 0; sub bylen { length($b) <=> length($a); } [download]	[reply] [d/l]
Re^3: script optmization by Anonymous Monk on May 14, 2017 at 23:00 UTC
Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file. serial: 12.557 seconds slurped: 1.644 seconds 7.6 x parallel: 0.531 seconds 23.6 x	[reply]
Re^4: script optmization by Anonymous Monk on May 14, 2017 at 23:27 UTC
Re: script optmization by zentara (Cardinal) on May 14, 2017 at 13:04 UTC
Hi, wouldn't the place to start be using Text::Diff or Algorithm::Diff? I'm not really a human, but I play one on earth. ..... an animated JAPH	[reply]
Re^2: script optmization by shoura (Novice) on May 14, 2017 at 13:23 UTC
but I don't want to show difference , the script attach seq of word in a text file if it exist in sequence file	[reply]
Re^2: script optmization by Anonymous Monk on May 14, 2017 at 13:43 UTC
Dude, did you even read the question?	[reply]