in reply to Re: Faster replacement of sed commands..
in thread Faster replacement of sed commands..

Hi...it worked for me..thanks a lot..but could you do me a little more favor? not able to understand below part of your code..

my $subsRE = do { local $" = q{|}; qr{(?x) \b ( @{ [ keys %replacementLU ] } ) \b}; };

please help me understand this...appreciate your efforts...

Replies are listed 'Best First'.
Re^3: Faster replacement of sed commands
by Athanasius (Archbishop) on Aug 20, 2014 at 09:11 UTC

    The solution by johngg has the elegance that makes Perl such a great language. To understand it, let’s work backwards:

    $text =~ s{$subsRE}{$replacementLU{ $1 }}g;

    This is where all the actual substitutions take place. The first part, {$subsRE}, looks for keyword matches (see below), and the /g modifier keeps looking until no more matches can be found. For each match found, the keyword, referenced by $1, is used as a hash key in the lookup table %replacementLU, and the value corresponding to that key is used for the substitution. So, for example, ARP_VULNERABILITY is replaced by y3.

    OK, you knew all that, but where does $subsRE fit in? Let’s print it out to see what it looks like:

    (?^:(?x) \b ( CONTENT_FILTERING_PROFILE_ID|QUOTA_GRANTED|ARP_VULNERABI +LITY|NETWORK_IDENTIFIER|ARP_PRIORITY_LEVEL|DEFAULT_BEARER_ID|EVENT_RE +SULT|EVENT_ID|QOS_PROFILE_ID|ARP_CAPABILITY|SYSTEM_IDENTIFIER|TRACKIN +G_AREA_CODE|GX_RAR_RAA_TRANSACTION|SERVICE_AREA_CODE|RECORD_TYPE|RECO +RD_LENGTH|CHARGING_PROFILE_ID|QOS_ASSIGNED_TO_DEFAULT_BEARER|GX_CCR_C +CA_TRANSACTION|RULE_REMOVED|BEARER_CONTROL_MODE|ROUTING_AREA_CODE|RUL +E_INSTALLED|CAUSE_PROTOCOL|SUBSCRIBERID ) \b)

    As you can see, this says: match any one of the keywords provided it is preceded and followed by a word boundary (\b). The character | separating the keywords is the metacharacter for alternation; for example, A|B|C means: match either A, or B, or C. (See “Metacharacters” in perlre#Regular-Expressions.) Note the capturing parentheses: if any of the keywords is matched, it is captured into the next available capture variable (which in this case is $1).

    OK, so where did this monster $subsRE come from? It would be no fun constructing this by hand, so johngg harnessed Perl to do the work. Note that qr// is a the Perl regex quote operator: it converts a string into a regular expression (see perlop#Regexp-Quote-Like-Operators). (?x) is the /x modifier in a different form. The string argument to qr// is constructed by interpolating the keys of the hash %replacementLU into the string. But just saying this:

    qr{(?x) \b ( keys %replacementLU ) \b};

    wouldn’t work because Perl would think you want to match the literal characters keys %replacementLU. Perl will interpolate when it sees a $ (for a scalar) or an @ (for an array), so we need to give Perl a construct like this: @{ ... }. But that says, dereference (something) to get an array. So we need to convert keys %replacementLU (which returns a list) into an array reference, which we do by creating an anonymous array with square brackets. So

    @{ [ keys %replacementLU ] }

    is the Perlish idiom for interpolating the contents of the list returned by the keys function into the string.

    Now all we need is to separate the elements of the list with | (alternation) characters. Normally, when a list is interpolated, the elements are separated by spaces. But actually they’re separated by whatever is the contents of the special variable $", for which a space is the default. By changing it to |, we get the elements of the list separated by alternation characters, which gives us the regex we want.

    johngg could have just said:

    $" = q{|}; my $subsRE = qr{(?x) \b ( @{ [ keys %replacementLU ] } ) \b};

    but that would leave $" set to |, which might interfere with other parts of the script. It’s better practice to localise any temporary changes made to global variables. The syntax:

    my $subsRE = do { local $" = q{|}; qr{(?x) \b ( @{ [ keys %replacementLU ] } ) \b}; };

    uses local to limit the scope of the assignment, and takes advantage of the fact that the do { ... }; syntax (1) provides an enclosing scope for the local $" = q{|}; assignment; and (2) returns the value of its final statement, in this case the regex returned by qr//.

    As Hannibal Smith liked to say: “I love it when a plan comes together.” :-)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thank you very much for the explanation...appreciate your efforts. Script is working great for me I have updated the script a little as per my requirements.

      #!/usr/bin/perl use Data::Dumper; use File::Copy; use feature qw{ say }; my %replacementLU = ( QOS_PROFILE_ID => q{x1}, CHARGING_PROFILE_ID => q{x2}, CONTENT_FILTERING_PROFILE_ID => q{x3}, SUBSCRIBERID => q{x4}, RECORD_LENGTH => q{x5}, RECORD_TYPE => q{x6}, EVENT_ID => q{x7}, EVENT_RESULT => q{x8}, CAUSE_PROTOCOL => q{x9}, DEFAULT_BEARER_ID => q{x0}, ARP_PRIORITY_LEVEL => q{y1}, ARP_CAPABILITY => q{y2}, ARP_VULNERABILITY => q{y3}, BEARER_CONTROL_MODE => q{y4}, TRACKING_AREA_CODE => q{y5}, ROUTING_AREA_CODE => q{y7}, SERVICE_AREA_CODE => q{y8}, SYSTEM_IDENTIFIER => q{y9}, NETWORK_IDENTIFIER => q{y0}, GX_RAR_RAA_TRANSACTION => q{TRAR}, GX_CCR_CCA_TRANSACTION => q{TCCA}, QUOTA_GRANTED => q{TQG}, QOS_ASSIGNED_TO_DEFAULT_BEARER => q{TQA}, RULE_INSTALLED => q{TRI}, RULE_REMOVED => q{TRR}, ); my $subsRE = do { local $" = q{|}; qr{(?x) \b ( @{ [ keys %replacementLU ] } ) \b}; }; my $counter = 0; open (WFH1, ">", "counter.txt"); for( ; ; ) { my $fileexists= -e "text.out"; if ($fileexists ne "1") { `touch text.out`; foreach my $file (</data/admin/scripts/SapcmedadpebM/test/*csv>) { chomp; $abc1=`find $file -mmin +10`; chomp($abc1); if ($abc1 eq "") { next; } print " file $abc1 \n"; $dd=`date`; print "$dd\n"; `perl -i -pe 's/[^[:ascii:]]//g; tr/\015//d' $abc1`; print "junk character removed\n"; open (FH, "$abc1"); open (WFH, ">", "abc1.op"); while (<FH>) { $_ =~ s{$subsRE}{$replacementLU{ $1 }}g; print WFH $_; } #`sed -i 's/QOS_PROFILE_ID/x1/g;s/CHARGING_PROFILE_ID/x2/g;s/CONTENT_F +ILTERING_PROFILE_ID/x3/g;s/SUBSCRIBERID/x4/g;s/RECORD_LENGTH/x5/g;s/R +ECORD_TYPE/x6/g;s/EVENT_ID/x7/g;s/EVENT_RESULT/x8/g;s/CAUSE_PROTOCOL/ +x9/g;s/DEFAULT_BEARER_ID/x0/g;s/ARP_PRIORITY_LEVEL/y1/g;s/ARP_CAPABIL +ITY/y2/g;s/ARP_VULNERABILITY/y3/g;s/BEARER_CONTROL_MODE/y4/g;s/TRACKI +NG_AREA_CODE/y5/g;s/ROUTING_AREA_CODE/y7/g;s/SERVICE_AREA_CODE/y8/g;s +/SYSTEM_IDENTIFIER/y9/g;s/NETWORK_IDENTIFIER/y0/g' $abc1`; #`sed -i 's/GX_RAR_RAA_TRANSACTION/TRAR/g;s/GX_CCR_CCA_TRANSACTION/TCC +A/g;s/QUOTA_GRANTED/TQG/g;s/QOS_ASSIGNED_TO_DEFAULT_BEARER/TQA/g;s/RU +LE_INSTALLED/TRI/g;s/RULE_REMOVED/TRR/g' $abc1`; move("abc1.op","./$abc1"); $counter++; print WFH1 $counter; } unlink "text.out"; } else { print "Exiting \n"; exit; } sleep(100); }

      Now when I execute this script to process a 400Mb i/p file (which it the required size) after generating 300Mb of output data it sticks.. doesn't throw any error, or doesn't fail..it just stops generating output data. Then I tried with a 200Mb file and again it stuck at 160Mb.. without any error message I am not able to find the root cause.. could it be a memory issue? could you please suggest any thing which can help us get this issue resolved? Please let me know if you need more info...thnaks

        I think you should follow the advice given by McA above:

        ...you're using the backtick operator very often which creates a subprocess doing the shell command. This is expensive. You can do all the tasks directly in Perl reducing the amount of subprocess creations.

        In particular, this line:

        `perl -i -pe 's/[^[:ascii:]]//g; tr/\015//d' $abc1`;

        creates a new shell subprocess with its own copy of the Perl interpreter on every loop iteration! Look at the answers already given by aitap and pvaldes, below, for advice on how to re-write your code in pure Perl, without using the backtick operator. There is no point in looking at other optimisations until you’ve removed the overhead of all those unnecessary shell subprocesses.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,