Re: multiple substitution

Replies are listed 'Best First'.
Re^2: multiple substitution by aaron_baugher (Curate) on Aug 25, 2012 at 16:39 UTC
I answered a similar question recently with a loop: `$s =~ s/$_/$h{$_}/g for keys %h;` [download] So I wondered how that would compare to your solution of combining the searches into a single regex. I thought your way might win for a few words, but surely with a lot of words the complexity of the regex would slow it down, right? Well, so much for that theory. The Perl regex engine continues to amaze me. I gave it a pattern combining 676 strings (all two-letter combinations) with pipes like yours, and it blew the forloop method away (92 times faster). It also beat a regex solution using Regexp::Assemble, but I was using very simple and known search strings, so the hand-made pipe method was safe and simple. With unknown or more complex strings, making it harder to hand-make a safe and efficient search pattern, I think RA would probably come out on top eventually. Anyway, my test and results: abaugher@bannor> cat 989705.pl #!/usr/bin/env perl use Modern::Perl; use Benchmark qw(:all); use Regexp::Assemble; my %h = map { $_ => uc } ( 'aa' .. 'zz' ); my $s = `cat bigfile`; # 8MB file say "Testing with @{[-s 'bigfile']} byte file and @{[ scalar keys %h ] +} patterns"; cmpthese( 10, { 'forloop' => \&forloop, 'pipes' => \&pipes, 'regexpa' => \&regexpa, }); sub forloop { $s =~ s/$_/$h{$_}/g for keys %h; } sub pipes { my $p = join '\|', keys %h; $s =~ s/($p)/$h{$1}/g; } sub regexpa { my $p = Regexp::Assemble->new->add(keys %h)->re; $s =~ s/($p)/$h{$1}/g; } abaugher@bannor> perl 989705.pl Testing with 8560854 byte file and 676 patterns Rate forloop regexpa pipes forloop 9.75e-02/s -- -96% -99% regexpa 2.40/s 2364% -- -74% pipes 9.08/s 9213% 278% -- [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^3: multiple substitution by AnomalousMonk (Archbishop) on Aug 25, 2012 at 18:11 UTC
The `pipes()` and `regexpa()` functions used in the timing loops above both include generation of the matching regexes in each loop execution. I doubt it adds greatly to the overall execution time, but is it proper to include regex generation in the timing of a substitution operation? On a more critical note, a substitution is done on the `$s` string in each repetition of each timing loop, but will there be anything to be found for substitution after the first pass of whatever timing function happens to be executed first? Are not all subsequent passes in all functions just comparing the time it takes for a regex to find no match in a string? (Maybe take the 8MB file content and `x` it into three identical 200 - 500MB strings and do just one comparison pass of substitutions on each string.)	[reply] [d/l] [select]
Re^3: multiple substitution by Corion (Patriarch) on Aug 25, 2012 at 16:48 UTC
I only (re)used what the OP had as a regular expression already. But your results mesh well with When Perl Isn't Quite Fast Enough - the less ops you need, and the more you can do within the RE engine, the faster your Perl code is.	[reply]
Re^2: multiple substitution by naturalsciences (Beadle) on Aug 25, 2012 at 10:08 UTC
Could you explain the code for a sec. Should those ! be /. I can understand `$string =~ s/(apples\|oranges\|bananas)/$replace{$1}/e` would take the first match from string ($1). Then because the /e tag the second part in substitution would be value complement to the key ($1). What is the deal with the \|\| (or?) statement. (I guess I'm mistaken with the ! elements) Would this (mine own )code work? `#!/usr/bin/perl -w use strict; use warnings; my @keys = qw(F29-2 F29-3 F29-4 F44-2 F53-2 F38-3 F12-2); my @vals = qw(F29B2 F29B3 F29B4 F44B2 F53B2 F38B3 F12B2); my %replace; @replace{@keys} = @vals; while (my $line = <>) { if($line =~ m/^\>/){my $name=$line;$name =~ s/(F29-2,F29-3,F29-4,F +44-2,F53-2,F38-3,F12-2)/$replace{$1}/;print $name;} elsif ($line!~m/^\>/){print $line;} }` [download] Did not want to use some convoluted regexp patterns because they might be usable this time but not always. Want to learn the tehnique to do such list/hash substitutions as in original question.	[reply] [d/l] [select]
Re^3: multiple substitution by AnomalousMonk (Archbishop) on Aug 25, 2012 at 17:28 UTC
Did not want to use some convoluted regexp patterns ... Want to learn the tehnique to do such list/hash substitutions ... A common approach to handling long search/replace string lists is to generate the search regex automatically from the keys of the search/replace hash. (Then you just have to worry about getting the hash right!) >perl -wMstrict -le "my @keys = qw(F29-2 F29-3 F29-4 F44-2 F53-2 F38-3 F12-2); my @vals = qw(F29B2 F29B3 F29B4 F44B2 F53B2 F38B3 F12B2); my %replace; @replace{@keys} = @vals; ;; my $rx_search = join q{ \| }, map quotemeta, keys %replace; $rx_search = qr{ $rx_search }xms; print $rx_search; ;; my $s = 'F99-9 FF29-22 -F29-2- F29-2 F44-2 F12-2'; print qq{'$s'}; my $t = $s; $t =~ s{ ($rx_search) }{$replace{$1}}xmsg; print qq{'$t'}; ;; $t = $s; $t =~ s{ \b ($rx_search) \b }{$replace{$1}}xmsg; print qq{'$t'}; ;; $t = $s; $t =~ s{ (?<! \S) ($rx_search) (?! \S) }{$replace{$1}}xmsg; print qq{'$t'}; " (?^msx: F29\-4 \| F53\-2 \| F44\-2 \| F29\-2 \| F38\-3 \| F29\-3 \| F12\-2 ) 'F99-9 FF29-22 -F29-2- F29-2 F44-2 F12-2' 'F99-9 FF29B22 -F29B2- F29B2 F44B2 F12B2' 'F99-9 FF29-22 -F29B2- F29B2 F44B2 F12B2' 'F99-9 FF29-22 -F29-2- F29B2 F44B2 F12B2' [download] Note that none of the conversion examples use the `/e` switch, which will make conversion slightly faster. In all the conversion examples, F99-9 is never converted: it just doesn't appear in the conversion `@keys` array. In the first conversion example, the F29-2 substring in FF29-22 and -F29-2- is converted even though it is embedded in another string: it appears in the conversion list. This is fixed for FF29-22 in the second example by using `\b` boundary assertions to allow conversion only if a search string is neither preceded nor followed by a 'word' character (`[A-Za-z0-9_]`), but this still allows the substring in -F29-2- to be replaced because '-' is not a word character. This problem (if problem it is) is fixed in the third example by using different boundary assertions: `(?<! \S)` and `(?! \S)` allow a match (and replacement) only if the potential match substring is neither preceded nor followed by a non-whitespace character. `$name =~ s/(F29-2,F29-3,F29-4,F44-2,F53-2,F38-3,F12-2)/$replace{$1}/;` [download] Note that \| (pipe) and not , (comma) is the alternation metacharacter. Update: aaron_baugher, in a reply already posted, gave an example of the automatic regex generation technique discussed above, but the examples of using boundary conditions to refine a match may still be useful.	[reply] [d/l] [select]
Re^3: multiple substitution by cheekuperl (Monk) on Aug 25, 2012 at 13:07 UTC
Should those ! be / That ! is alright. Perl allows that. Could you explain the code for a sec. `$replace{$1} \|\| $1` [download] This part helps you replace the matched string with itself in case %replace does not have corresponding key. For example, Did not want to use some convoluted regexp Trust me, this is a simple regex. It can get a lot worse, if you delve deeper :) Want to learn the tehnique to do such list/hash substitutions as in original question As far as searching and replacing in strings is concerned, I guess regexes would be most helpful.	[reply] [d/l]
Re^3: multiple substitution by Corion (Patriarch) on Aug 25, 2012 at 13:20 UTC
If you are unfamiliar with `s///` and `s!!!`, I already linked to the relevant documentation, perlop. Please read it. Regarding your own attempt, what happened when you tried it?	[reply] [d/l] [select]
Re^2: multiple substitution by naturalsciences (Beadle) on Aug 25, 2012 at 09:34 UTC
OK thanks! quote:"s///e treats the replacement text as Perl code, rather than a double-quoted string." Well that could be useful!	[reply]