Re: script optmization
by 1nickt (Canon) on May 14, 2017 at 16:37 UTC
|
Hi, in two threads now you are asking for advice on how to mark matching text substrings with some special marker, by hand, in a very naive way. I suppose you are then going through and looking for the marked substrings to do something else with them.
If you described at a higher level what it is you want to accomplish, the monks could probably provide advice that would get you to your goal in a much simpler way, and coincidentally you'll probably get the optimization you seek.
Perl is the most useful tool set for processing text. It's almost certain that there are modules on CPAN to do what you want. Part of learning to be a programmer is learning which tools already exist and which are best for the job, not always to make a new tool.
Your code as presented could be improved in quite a few ways, but I for one am not inclined to spend time on it since it so obviously is the wrong approach to something. So, care to describe your real task and goal?
( See also XY Problem; Premature Optimization )
Hope this helps!
The way forward always starts with a minimal test.
| [reply] |
Re: script optmization
by tybalt89 (Monsignor) on May 14, 2017 at 21:13 UTC
|
Here's one that does case-insensitive matching and preserves any capitalization.
It just needs slight modification to read and write files. :)
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1190261
use strict;
use warnings;
my $seq_txt_file = <<END;
scooped up by
social travesty
without proper sanitation
END
my @seqs = sort { length $b <=> length $a }
map s/^\s+|\s+$//gr,
split /\n/, $seq_txt_file;
my $regex = do { local $" = '|'; qr/(@seqs)/i };
print s/$regex/ $1 =~ s!\h+!bbb!gr /ger while <DATA>;
__DATA__
Many of them are scooped up by chambermaids, thrown into bin bags and
+sent off to landfill sites, which is a disaster for the environment a
+nd a social travesty given that many people around the world are goin
+g without proper sanitation.
Another test with capitals, a Social Travesty if I've ever seen one wi
+thout Proper Sanitation.
| [reply] [d/l] |
Re: script optmization
by BillKSmith (Monsignor) on May 14, 2017 at 14:56 UTC
|
None of the solutions given so far would find a word that is divided at the end of a line. Is this as issue? What about capitalization? Embedded in longer words? Plurals, etc?
I doubt that it would help your speed problem, but your 'bylen' is good candidate for a module. In fact, it is an example in the documentation for List::UtilsBy.
| [reply] |
Re: script optmization
by kcott (Archbishop) on May 15, 2017 at 03:53 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use Inline::Files;
my %seq;
while (<SEQ>) {
my ($trim) = /^\s*(.*?)\s*$/;
($seq{$trim} = $trim) =~ s/\h+/bbb/g;
}
my $re = qr{(@{[join '|', sort { length $b <=> length $a } keys %seq]}
+)};
while (<TXT>) {
s/$re/$seq{$1}/g;
print;
}
__SEQ__
scooped up by
social travesty
without proper sanitation
__TXT__
Many of them are scooped up by chambermaids, thrown into bin bags and
+sent off to landfill sites, which is a disaster for the environment a
+nd a social travesty given that many people around the world are goin
+g without proper sanitation.
Output:
Many of them are scoopedbbbupbbbby chambermaids, thrown into bin bags
+and sent off to landfill sites, which is a disaster for the environme
+nt and a socialbbbtravesty given that many people around the world ar
+e going withoutbbbproperbbbsanitation.
| [reply] [d/l] [select] |
|
|
Ken, I really like your post++.
A couple of very,very minor nits which I show in code below:
- I think the fastest way to remove leading and trailing white space is like the code below, using 2 Perl statements instead of $string =~ s/^\s+|\s+$//g or your my ($trim) = /^\s*(.*?)\s*$/;. The Perl documentation talks about this somewhere in the regex docs. But a quick search didn't find this quickly otherwise I would post a link. Anyway, the explanation goes that regex engine works best with fixed anchors and that 2 very easy regex statements run faster than a single more complex one.
- I split your $re statement into two parts to simplify the syntax. Creating an intermediate variable is very "cheap". I didn't benchmark, but your code creates an anon array which is then de-referenced. My code only creates a scalar, which in general will be faster.
- I see no need at all to sort the search terms, so I didn't do that. The regex is going to match any of the 3 or'd "search phrases" no matter what the order in the regex is. Changing the order in the regex will not necessarily result in any performance change at all. The OP's requirement "for a sorted order" makes no sense to me at all.
- I see some suggestion to use threads or other parallel processing strategies. It appears to me that this will be an I/O bound application and such complex things won't matter at all.
Having said the above. Neither point makes a darn bit of difference in this case. I made this post because point (1) has relevance beyond this Op's question. For performance: The "setup" won't matter much because this is done once. Then: Read Line, Run Regex, Print Line is about as fast as this usually gets without complicated heroics.
Another Monk queried about the OP's purpose? Sometimes a post is just an academic question. Sounds like there is some real application here that we don't understand. The only reason to put these "markers" into the text is for later processing. Maybe that processing, whatever it is, can be combined into a single step? That could lead to a big speed increase. I mean that second step of processing will have to search the entire text to find the bbb markers yet again.
#!/usr/bin/env perl
use strict;
use warnings;
use Inline::Files;
my %seq; # example: 'scooped up again' => 'scoopedbbbupbbbagain',
while (my $line = <SEQ>)
{
$line =~ s/^\s+//;
$line =~ s/\s+$//;
($seq{$line} = $line) =~ s/\h+/bbb/g;
}
my $search_phrases = join '|', keys %seq;
my $re = qr{($search_phrases)};
while (<TXT>) {
s/$re/$seq{$1}/g;
print;
}
__SEQ__
scooped up by
social travesty
without proper sanitation
__TXT__
Many of them are scooped up by chambermaids, thrown into bin bags and
+sent off to landfill sites, which is a disaster for the environment a
+nd a social travesty given that many people around the world are goin
+g without proper sanitation.
| [reply] [d/l] [select] |
|
|
G'day Marshall,
Thanks for the positive feedback.
I have some comments on your first three points.
Re "... fastest way to remove leading and trailing white space ...".
I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book:
the regex I used was anchored at both ends (/^\s*(.*?)\s*$/).
In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on.
I wrote this benchmark:
#!/usr/bin/env perl -l
use strict;
use warnings;
use constant STRING => " \t aaa bbb ccc \t \n";
use Benchmark 'cmpthese';
print 'Sanity Tests:';
print 'shoura: >', shoura_code(), '<';
print 'kcott: >', kcott_code(), '<';
print 'marshall: >', marshall_code(), '<';
cmpthese 0 => {
S => \&shoura_code,
K => \&kcott_code,
M => \&marshall_code,
};
sub shoura_code {
local $_ = STRING;
chomp;
s/^\s+|\s+$//g;
return $_;
}
sub kcott_code {
local $_ = STRING;
($_) = /^\s*(.*?)\s*$/;
return $_;
}
sub marshall_code {
local $_ = STRING;
s/^\s+//;
s/\s+$//;
return $_;
}
I ran it five times — that's usual for me — here's the result that was closest to an average:
Sanity Tests:
shoura: >aaa bbb ccc<
kcott: >aaa bbb ccc<
marshall: >aaa bbb ccc<
Rate S M K
S 292306/s -- -32% -37%
M 432626/s 48% -- -7%
K 464863/s 59% 7% --
There was quite a lot of variance; although 'K' was always faster than 'M'.
The five K-M percentages were: 9, 7, 2, 14, 7.
Both 'K' and 'M' were always substantially faster than 'S'.
Re "... split your $re statement into two parts ...".
I often use the '@{[...]}' construct when interpolating the results of some processing into a string.
My main intent was to create the regex once,
instead of the (presumably) millions of times in the inner loop of the OP's code.
I also benchmarked this (see the spoiler):
it looks like your total saving would be measured in nanoseconds.
Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.".
I can understand that from the minimal test data supplied by the OP;
however, the reason is probably to handle sequences with common sections.
Consider the test data I used in the second benchmark:
my %seq = (
'W X Y' => 'WbbbXbbbY',
'X Y' => 'XbbbY',
'X Y Z' => 'XbbbYbbbZ',
);
If the target string was "W X Y Z", the results could one of these three:
W XbbbY Z
WbbbXbbbY Z
W XbbbYbbbZ
Sorting by length would reduce that to two results.
There may well be a requirement to also sort lexically. Perhaps like this:
sort { length $b <=> length $a || $a cmp $b }
But the OP has not given sufficient information.
In fact, as I write this, it's been almost two days since the original posting
and all requests for additional information have been ignored.
| [reply] [d/l] [select] |
Re: script optmization
by Anonymous Monk on May 14, 2017 at 13:40 UTC
|
I don't understand the purpose of what you're doing, but you can move some of the work outside the main loop like this:
my @pat;
foreach my $r (@seq) {
my $t = $r;
$t =~ s/\h+/bbb/g;
push @pat, [ $r, $t ];
}
while (<$fh>) {
foreach my $p (@pat) {
s/$p->[0]/$p->[1]/g;
}
print Newfile;
}
And if you want to get fancy, maybe even this:
my $seq = join '|', @seq;
my %seq;
foreach my $r (@seq) {
(my $t = $r) =~ s/\h+/bbb/g;
$seq{$r} = $t;
}
while (<$fh>) {
s/($seq)/$seq{$1}/go;
print Newfile;
}
| [reply] [d/l] [select] |
|
|
sorry but this don't work
| [reply] |
|
|
What does "don't work" mean? If you want help, you're going to have to give us more information.
| [reply] |
Re: script optmization
by Anonymous Monk on May 14, 2017 at 21:33 UTC
|
Loading 200 MB into available memory is likely possible on today's hardward. If so, the following runs ~ 7 times faster. The idea here is iteratating over seq one time. For larger data files, one can chunk 300 MB at a time time and not forgetting to read till the end of line to have a complete chunk. Then process the chunk similarly.
use strict;
use warnings;
use autodie;
open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt";
my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt');
open(my $fh, $f1);
foreach (<$fh>) {
chomp; s/^\s+|\s+$//g;
push @seq, $_;
}
close $fh;
@seq = sort bylen @seq; # need to sort @seq by length.
my $data; { open($fh, $f2); local $/; $data = <$fh>; }
foreach my $r (@seq) {
my $t = $r; $t =~ s/\h+/bbb/g;
$data =~ s/$r/$t/g;
}
print Newfile $data;
close Newfile ;
exit 0;
sub bylen {
length($b) <=> length($a);
}
| [reply] [d/l] |
|
|
use strict;
use warnings;
use autodie;
use MCE::Flow;
open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt";
Newfile->autoflush(1); # important, enable autoflush
my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt');
open(my $fh, $f1);
foreach (<$fh>) {
chomp; s/^\s+|\s+$//g;
push @seq, $_;
}
close $fh;
@seq = sort bylen @seq; # need to sort @seq by length.
MCE::Flow::init {
max_workers => 4, chunk_size => '24m',
init_relay => 1, use_slurpio => 1,
};
# For best performance, provide MCE the path, e.g. $f2
# versus a file handle. Workers communicate among themselves
# the next offset without involving the manager process.
mce_flow_f sub {
my ($mce, $slurp_ref, $chunk_id) = @_;
foreach my $r (@seq) {
my $t = $r; $t =~ s/\h+/bbb/g;
$$slurp_ref =~ s/$r/$t/g;
}
# Relay capability is useful for running something orderly.
# For this use case, we've enabled autoflush on the file above.
# Only one worker is allowed to run when entering the block.
MCE::relay sub { print Newfile $$slurp_ref };
},
$f2;
MCE::Flow::finish();
close Newfile;
exit 0;
sub bylen {
length($b) <=> length($a);
}
| [reply] [d/l] |
|
|
Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file.
serial: 12.557 seconds
slurped: 1.644 seconds 7.6 x
parallel: 0.531 seconds 23.6 x
| [reply] |
|
|
Re: script optmization
by zentara (Cardinal) on May 14, 2017 at 13:04 UTC
|
| [reply] |
|
|
but I don't want to show difference , the script attach seq of word in a text file if it exist in sequence file
| [reply] |
|
|
Dude, did you even read the question?
| [reply] |