in reply to Re: using system command in regex
in thread using system command in regex

Thanks for reply. Now I know how chomp working in my above program. I write (extended) my program in another way. Here as you said I used (\s+) instead of "\s".

  my $SMPP_count = int ((split (/\s+/,`cut -d "|" -f 1,10,13 SMSCDR*$date$hour$minute*.log |grep "Submit|GSM" |grep "$hour:$min:$sec" |sort |uniq -c`)) [1]) + int ((split (/\s+/,`cut -d "|" -f 1,10,13 SMSCDR*$date$hour$minute*.log |grep "Submit|SMPP" |grep "$hour:$min:$sec" |sort |uniq -c`)) [1]);

Here the array returning by split contains "NULL" in the 0th position. If you write "(split ..., ...)[0]" its giving the result like

Use of uninitialized value in int at second.pl line 34. 06:20:00= 0 0 Argument "" isn't numeric in int at second.pl line 34. Argument "" isn't numeric in int at second.pl line 34. 06:20:01= 0 0 Argument "" isn't numeric in int at second.pl line 34. Argument "" isn't numeric in int at second.pl line 34. 06:20:02= 0 0 Argument "" isn't numeric in int at second.pl line 34. Argument "" isn't numeric in int at second.pl line 34. 06:20:03= 0 0 Argument "" isn't numeric in int at second.pl line 34. Argument "" isn't numeric in int at second.pl line 34. 06:20:04= 0 0 Argument "" isn't numeric in int at second.pl line 34. Argument "" isn't numeric in int at second.pl line 34. 06:20:05= 0 0

Why the first element in the array is "NULL"????

Replies are listed 'Best First'.
Re^3: using system command in regex
by shmem (Chancellor) on Oct 13, 2015 at 12:34 UTC
    my $SMPP_count = int ((split (/\s+/,`cut -d "|" -f 1,10,13 SMSCDR*$ +date$hour$minute*.log |grep "Submit|GSM" |grep "$hour:$min:$sec" |sor +t |uniq -c`)) [1]) + int ((split (/\s+/,`cut -d "|" -f 1,10,13 SMSCDR +*$date$hour$minute*.log |grep "Submit|SMPP" |grep "$hour:$min:$sec" | +sort |uniq -c`)) [1]);

    Short answer: since you are using uniq -c as the last filter in you pipelines, you are interested in the first field. This field has leading whitespace. From the documentation of split:

    As another special case, "split" emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a literal string composed of a single space character (such as ' ' or "\x20", but not e.g. "/ /"). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were "/\s+/"; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.

    Long answer: you are running

    • 1 x perl
    • 2 x /bin/sh (at each qx() or backtick ``)
    • 4 x grep
    • 2 x sort
    • 2 x uniq
    which makes for 11 processes in total, and you are reading each file that matches SMSCDR*$date$hour$minute*.log twice - to get a sum which perl happily would give you in a less convoluted way in just 1 process.

    • cut -d "|" -f 1,10,13 would be (split '|', $_)[0,9,12]
    • grep and sort are perl builtins
    • use a hash (see perldata) and use its keys for uniqueness
    • you can use the bultin glob to expand SMSCDR*$date$hour$minute*.log into a list of filenames

    From your code I am guessing that your log files contain a timestamp in the first field, Submit occurs in the 10th field, and you want lines which contain GSMor SMPP in the 13th field.
    Putting it all together, omitting uneccesary steps and not writing perl as if it were shell:

    @ARGV = glob "SMSCDR*$date$hour$minute*.log"; my $SMPP_count; my $stamp = "$hour:$min:$sec"; while(<>){ chomp; my @ary = (split '|')[0,9,12]; $SMPP_count++ if $ary[0] eq $stamp and $ary[1] eq "Submit" and $ary[2] =~ /(GSM|SMPP)/; }; print $SMPP_count;
    update: corrected code
    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Thanks for sharing your knowledge shmem. I replaced the code you wrote (with some needed modifications) in my program. But Its really dead slow.

       the code took:289 wallclock secs ( 6.43 usr  0.72 sys + 388.95 cusr 58.95 csys = 455.05 CPU)

      This result is for my program (using grep & cut). But after modified in this 300 wallclock secs its runs the data only for 36 seconds. Then to complete the total 10 minutes data it will take nearly .... (Actually I dont know :-) ). My modified code is

      my $greatest = 0; my $total = 0; my @files = glob "SMSCDR*$date$hour$minute*.log"; foreach my $min ($minute .. $minute+9) { foreach my $sec (@seconds) { # my $SMPP_count = int ((split (/\s+/,`cut -d "|" -f 1,1 +0,13 SMSCDR*$date$hour$minute*.log |grep "Submit|GSM" |grep "$hour:$m +in:$sec" |sort |uniq -c`)) [1]) + int ((split (/\s+/,`cut -d "|" -f 1 +,10,13 SMSCDR*$date$hour$minute*.log |grep "Submit|SMPP" |grep "$hour +:$min:$sec" |sort |uniq -c`)) [1]); my $SMPP_count; my $stamp = "$hour:$min:$sec"; foreach my $file (@files) { open (FILE,"$file"); while(<FILE>) { chomp; my @ary = (split /\s|\|/, $_) [3,21,24 +]; $SMPP_count++ if $ary[0] eq $stamp an +d $ary[1] eq "Submit" and $ary[2] =~ /(GSM|SMPP)/; }; } if ($SMPP_count > $greatest) { $greatest = $SMPP_count; } $total = $total + $SMPP_count; print "$hour:$min:$sec","= $SMPP_count","\t",$total,$/ +; } } print $greatest,$/; my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

      Note: The result is same for both the programs. As you said the reading of files has to be faster than grep & cut. But here its not working like that. Where I am missing I don't understand.

      Update:

      Here when you use glob it will return three files. In that post_paid contains 8278 lines, prepaid contains 23072 lines, delivery_file contains 80097 lines. If you are calculating for first second it have to check "1,11,447" lines. Like that for 10 minutes (600 seconds) the program had to check "6,68,68,200" lines. Please show me a way to get rid of this

        Of course it is dead slow. You are opening, reading and closing each file for 10 * 60 = 600 times to get the sum for each second. You should read each file once and store the sums in a hash, keyed by the timestamp, like so:

        my %SMPP_count; foreach my $file (@files) { open (FILE,"$file"); while(<FILE>) { next unless /\b(?:GSM|SMPP)\b/; # avoid uninteresting +lines chomp; my @ary = (split /\s|\|/, $_) [3,21,24]; my $time = (split ' ', $ary[0])[4]; # first element is +t timestamp, right? $SMPP_count{$time}++ if $ary[0] eq $stamp and $ary[1] + eq "Submit" and $ary[2] =~ /(GSM|SMPP)/; }; } # now iterate over the keys of the hash to make up your sums for my $time ( sort keys %SMPP_count) { my $sum = $SMPP_count{$time}; ... }
        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        Thanks for sharing your knowledge shmem.

        Heh. My nick is programmatic. - Please provide sample input. What does your data look like? What are you trying to acomplish? what is your expected output? Just a sum? This whole thread is about an XY Problem it would seem.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re^3: using system command in regex
by AppleFritter (Vicar) on Oct 13, 2015 at 12:07 UTC

    Don't do everything in one huge unreadable line; break it up, and take a look at what the intermediate steps produce, and I'm sure the problem will become much clearer.

    FWIW if you're processing log files using cut, grep, sort, uniq etc., you can probably also read and process them in your Perl script instead. Doing that would also be more robust, more readable/maintainable, and (if log formats change) more future-proof.

      At first I started to read the file and processing. But why I choose these path is

       Tue Oct 13 00:00:01 2015|1008|959788857688|123580|Tue Oct 13 00:00:01 2015|Tue Oct 13 00:00:01 2015|CMT|Undelivered|none|Submit|0|SMSC_PR_LC_SMSC_InvalidDestAddress|GSM|INVALID|16 Bit|140|140|no||no|no||1/2|No|NO|no|no|0|1|0|0||959790000028||8|0||0|no|no|default_billing|-1|0|no|no|1|1|1|1|0|1|0|0|||||Wed Oct 14 00:00:01 2015|SR|||IV|011809614446710010008|||0|0||0|0||0||||123580||||||||||||||||||

      This is one line of my log. If you Maximise the zero element  Tue Oct 13 00:00:01 2015 the date & time is available. Like this I have files for each ten minutes.

      SMSCDR_PREPAY_151013000000_10.84.0.29_AS.log (From 00:00:01 to 00:10:0 +0) SMSCDR_PREPAY_151013100000_10.84.0.29_AS.log (From 00:10:01 to 00:20:0 +0) SMSCDR_PREPAY_151013200000_10.84.0.29_AS.log SMSCDR_PREPAY_151013300000_10.84.0.29_AS.log SMSCDR_PREPAY_151013400000_10.84.0.29_AS.log SMSCDR_PREPAY_151013500000_10.84.0.29_AS.log (From 00:50:01 to 01:00:0 +0)

      Like this we have POSTPAID_CDR & DELIVERY_CDR for every 10 minutes as like mentioned above

      Now my requirement is

      00:10:00= 0 0 00:10:01= 158 158 00:10:02= 163 321 00:10:03= 214 535 00:10:04= 123 658 00:10:05= 174 832 00:10:06= 271 1103 00:10:07= 96 1199 00:10:08= 263 1462 00:10:09= 72 1534 00:10:10= 190 1724

      The first element is time. The total after "=" is total entries noted in that particular second where 10th field (Submit) & 13 th field (GSM)in all prepay, postpaid, delivery CDR's. The last one is total entries upto that second from starting of the file.

      And finally I need the highest entries noted in which second & that count

      For that I written code like this

      my @low_sec = ("00","01","02","03","04","05","06","07","08","09"); my @high_sec = (10 .. 59); my @seconds = (@low_sec,@high_sec); my $greatest = 0; my $total = 0; foreach my $min ($minute .. $minute+9) { foreach my $sec (@seconds) { my $SMPP_count = int ((split (/\s+/,`cut -d "|" -f 1,1 +0,13 SMSCDR*$date$hour$minute*.log |grep "Submit|GSM" |grep "$hour:$m +in:$sec" |sort |uniq -c`)) [1]) + int ((split (/\s+/,`cut -d "|" -f 1 +,10,13 SMSCDR*$date$hour$minute*.log |grep "Submit|SMPP" |grep "$hour +:$min:$sec" |sort |uniq -c`)) [1]); if ($SMPP_count > $greatest) { $greatest = $SMPP_count; } $total = $total + $SMPP_count; print "$hour:$min:$sec","= $SMPP_count","\t",$total,$/ +; } } print $greatest,$/;

      Is there any easier method please guide me