in reply to Re^3: using system command in regex
in thread using system command in regex

Thanks for sharing your knowledge shmem. I replaced the code you wrote (with some needed modifications) in my program. But Its really dead slow.

 the code took:289 wallclock secs ( 6.43 usr  0.72 sys + 388.95 cusr 58.95 csys = 455.05 CPU)

This result is for my program (using grep & cut). But after modified in this 300 wallclock secs its runs the data only for 36 seconds. Then to complete the total 10 minutes data it will take nearly .... (Actually I dont know :-) ). My modified code is

my $greatest = 0; my $total = 0; my @files = glob "SMSCDR*$date$hour$minute*.log"; foreach my $min ($minute .. $minute+9) { foreach my $sec (@seconds) { # my $SMPP_count = int ((split (/\s+/,`cut -d "|" -f 1,1 +0,13 SMSCDR*$date$hour$minute*.log |grep "Submit|GSM" |grep "$hour:$m +in:$sec" |sort |uniq -c`)) [1]) + int ((split (/\s+/,`cut -d "|" -f 1 +,10,13 SMSCDR*$date$hour$minute*.log |grep "Submit|SMPP" |grep "$hour +:$min:$sec" |sort |uniq -c`)) [1]); my $SMPP_count; my $stamp = "$hour:$min:$sec"; foreach my $file (@files) { open (FILE,"$file"); while(<FILE>) { chomp; my @ary = (split /\s|\|/, $_) [3,21,24 +]; $SMPP_count++ if $ary[0] eq $stamp an +d $ary[1] eq "Submit" and $ary[2] =~ /(GSM|SMPP)/; }; } if ($SMPP_count > $greatest) { $greatest = $SMPP_count; } $total = $total + $SMPP_count; print "$hour:$min:$sec","= $SMPP_count","\t",$total,$/ +; } } print $greatest,$/; my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Note: The result is same for both the programs. As you said the reading of files has to be faster than grep & cut. But here its not working like that. Where I am missing I don't understand.

Update:

Here when you use glob it will return three files. In that post_paid contains 8278 lines, prepaid contains 23072 lines, delivery_file contains 80097 lines. If you are calculating for first second it have to check "1,11,447" lines. Like that for 10 minutes (600 seconds) the program had to check "6,68,68,200" lines. Please show me a way to get rid of this

Replies are listed 'Best First'.
Re^5: using system command in regex
by shmem (Chancellor) on Oct 15, 2015 at 07:51 UTC

    Of course it is dead slow. You are opening, reading and closing each file for 10 * 60 = 600 times to get the sum for each second. You should read each file once and store the sums in a hash, keyed by the timestamp, like so:

    my %SMPP_count; foreach my $file (@files) { open (FILE,"$file"); while(<FILE>) { next unless /\b(?:GSM|SMPP)\b/; # avoid uninteresting +lines chomp; my @ary = (split /\s|\|/, $_) [3,21,24]; my $time = (split ' ', $ary[0])[4]; # first element is +t timestamp, right? $SMPP_count{$time}++ if $ary[0] eq $stamp and $ary[1] + eq "Submit" and $ary[2] =~ /(GSM|SMPP)/; }; } # now iterate over the keys of the hash to make up your sums for my $time ( sort keys %SMPP_count) { my $sum = $SMPP_count{$time}; ... }
    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      What an Idea pragrammatic.... Nearly from three days struggling with this. Finally I solved it. Working great. Thank you very much. And I want to ask you one more thing. As already told I have three (Postpaid, Prepaid, Delivery) files for every 10 minutes. After calculating I am uploading all the values into DB like this.

      Date Hour Mo_resp MT_resp AO_resp Percentage 10-08-2015 00:00 256 382 36 87% 10-08-2015 00:10 491 438 12 92%

      (Its a sample. Actually my DB contains 38 columns) Like this I am uploading all values for every 10 minutes. Now my requirement is to add all MO_resp all AO_resp ... and son on (all columns) which occurred in 00 hour to write an hourly report in excel sheet. For that I am Doing

      use DBI; my $hour_db = DBI->connect("DBI:mysql:database=$db;host=$host;mysql_so +cket=/opt/lampstack-5.5.27-0/mysql/tmp/mysql.sock","root","", {'Raise +Error' => 1}); my @column_names = ("MO_resp","MT_resp","AO_resp"); foreach my $column_name (@column_names) { my $hour_sth = $hour_db->prepare("Select sum($column_name) from $t +able_name where Date='$db_date' and Hour like='$hour:%'"); $hour_sth->execute() or die $DBI::errstr; ..... }

      #Like this I am reading each column sum one by one. But I feel this is not a good method. Can you show me a way???

        First, I have to second marto in that you should always use placeholders, third (since second is already used) you could aggregate all your columns into one call:

        my @column_names = ("MO_resp","MT_resp","AO_resp"); my $sql = "select ".join ",", map { "sum($_)" } @column_names; $sql .= " from ? where Date = ? and Hour like ?"; my $hour_sth = $hour_db->prepare( $sql ); $hour_sth->execute($table_name, $db_date, "$hour:%") or die $DBI::errs +tr;

        See join and map.
        If you are iterating over $db_date and $hour, you should keep the call to $hour_db->prepare outside those loops to reduce the overhead of binding placeholders in the SQL statement.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        plus you seem to have misunderstood the first sentence of this. "programmatic" is a quality of his nick, see e.g. there :-)
Re^5: using system command in regex
by shmem (Chancellor) on Oct 14, 2015 at 17:48 UTC
    Thanks for sharing your knowledge shmem.

    Heh. My nick is programmatic. - Please provide sample input. What does your data look like? What are you trying to acomplish? what is your expected output? Just a sum? This whole thread is about an XY Problem it would seem.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      shmem :-) Programmatic I already posted that in this thread. Please check it here 1144697