Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How to reslove this memory issue

by Ankur_kuls (Sexton)
on Sep 10, 2014 at 06:12 UTC ( [id://1100083]=perlquestion: print w/replies, xml ) Need Help??

Ankur_kuls has asked for the wisdom of the Perl Monks concerning the following question:

I have this Perl script which first fetches one field and its corresponding value from a 1gb(around 150000 lines) input file and stores into one hash. now after processing this file it takes another 16gb input file (around 30 million lines), reads line by line, compares it with the hash keys and print the corresponding hash value to the output file with few other required data, creating a 2gb output file. Now they have asked me to add 4 new fields into the report which needs 4 more hashes to be included into the script. but when I added and tested the script...it caused severe memory issue in the server.. current memory details on my linux server.

# free total used free shared buffers cac +hed Mem: 8158176 8124068 34108 0 13248 7928 +420 -/+ buffers/cache: 182400 7975776 Swap: 2104472 103976 2000496

I need to know how can I resolve this memory issue...If needed, how much more ram I need to add into my server...I know information provided by me could be vague for you.. but please let me know what else you need to know (shall I paste the complete script here?)...please help...

Replies are listed 'Best First'.
Re: How to reslove this memory issue
by BrowserUk (Patriarch) on Sep 10, 2014 at 07:31 UTC

    Three possible approaches:

    1. Buy another 8GB DIMM for your server.

      At ~£70/$100, this is quick, simple and very cheap.

      Datasets only ever seem to get bigger, so this would somewhat future proof you.

    2. Get cleverer about the way you build your indexes. (You currently use hashes for this.)

      Depending upon your data, there may be less memory intensive ways of building your indexes.

      A few (real or realistic) examples of the data showing the datatype (string/real/integer) of the keys and the size and nature of the values, would me far more useful than your script which I think you've adequately described.

      Should result in an equally fast (or possibly faster) processing; but requires 'cooperative data', so may not be applicable; and requires some rework of your script, though the basic structure would remain the same.

    3. Process each of your fields in separate passes of the script to produce intermediate output files, and then use a final pass over those intermediate files to merge them.

      Slow. Quite a lot of work.

    A fourth approach would be to use a database, but I'll let others tell you about that.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to reslove this memory issue
by CountZero (Bishop) on Sep 10, 2014 at 08:30 UTC
    Did you consider using a database? The info you provide is somewhat sketchy, but it looks rather straightforward to transform each of your input files into a database table and then write an SQL query to extract the data. Adding a few more fields to your output would be no more work than editing your SQL query and adding those fields to the "select" and "from" keywords.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: How to reslove this memory issue
by shmem (Chancellor) on Sep 10, 2014 at 10:17 UTC
    ...input file and stores into one hash.

    A quick approach to store a single hash on disk is DB_File which interfaces Berkeley DB. For more complex data structures, there's DBM::Deep. You will need disk space.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      I'll second these suggestions. Note that any file solution will increase the execution time.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: How to reslove this memory issue
by Hameed (Acolyte) on Sep 10, 2014 at 07:41 UTC
Re: How to reslove this memory issue
by dsheroh (Monsignor) on Sep 10, 2014 at 13:40 UTC
    This is a common misunderstanding of the output of free. The memory statistics you showed do not indicate excessive memory use. On the contrary, they show only about 2.3% of system memory in active use by programs.

    If you look more closely at the first line of numbers, you will see that 7928420 of the 8124068 memory pages in use are being used to cache data. This memory is, for all intents and purposes, free. Your operating system it using it to improve system performance, but can throw the cached data out and reallocate the memory for other uses more-or-less instantly if the memory is needed elsewhere.

    To see whether you actually have a memory shortage, you need to look at the second line of numbers, which adjusts the totals by treating buffers and cache as free memory rather than used (because, again, if needed, that memory can be reassigned instantly). Looking at that line, you will see that only 182400 pages are in use for "real" data, leaving 7975776 available.

      One possible thing to consider, though, is that if this command was executing on a server that was doing many other things at the same time, we don’t necessarily know who is using those buffers, nor what distribution of buffer use is needed, on the system as a whole, to make everything run efficiently.   When the OP said that the activity had a serious impact, I believe him.   Stealing a buffer is not a no-cost operation, because now it means that you must do another physical disk-read to re-obtain whatever it once held.

      Even though I’m just on my first cup of coffee this morning, I also have a healthy skepticism about these numbers.   They seem much too small, given the reports of what this program is supposed to be doing and how much memory it ought to be taking up.   You really need to profile this thing, to see what it’s waiting on ... how much disk I/O, how many page faults and so on.   How much virtual-memory it wants vs. what its resident page-set size is, under actual load conditions.

      It could well be that a database-oriented solution (which can basically do a job like this with a JOIN) might be a better solution for two reasons.   First, that a “memory oriented” solution might well still wind up doing an equivalent amount of disk I/O ... especially if several other big programs are running elsewhere ... and because, with two tables and a JOIN, this requirement would probably be completed by now.   After all, of this we can be sure:   soon, there will be four more fields to add; then, Marketing will ask for another six.   So it goes.   This approach seems to be leading quite rapidly to a corner-box, and is rapidly losing its “shine.”

Re: How to reslove this memory issue
by Discipulus (Canon) on Sep 10, 2014 at 07:41 UTC
    Hello,
    vague question, vague answer...
    Search the monastery for large files and try using Iterators.

    I'm not the best one here around but i think Perl does not release memory to the OS (i think it depends by the OS) but when you free memory used by the current instance of Perl program it become available to the program itself.
    So be sure to control the scope of your buffers 'cause when something goes out of scope it frees his memory, especially file handles.

    If you can post the minimum code that produces the same memory consumption i'm sure you'll find some good advice.

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Hi.. Thanks for the reply. Below is the minimum code. In this part records of 1gb i/p file(AH) is processed in such a way that they take the form of a hash reference. Now values of field 'bidirVolume' is stored in a new array AccUsage.

      my $AccUsg; while(<AH>) { chomp; my $line=$_; $AccuCount++; my $MobileNumber; if( $line =~ /subscriberId:(\w+)\(\"(\d+)\"\)/ ) { $MobileNumber=$2; } my $plan=$line; $plan =~s/\\//g; my @AccVolume; if ( $plan =~ /usageControlAccum:(\w+)\(\"(.*)\"\)/ ) { my $p=$2; $p=~s/:\{/ => {/g; $p=~s/:\[/ => [/g; $p=~s/\"/\'/g; $p=~s/\':/\'=>/g; $p=~s/\}n/\}/g; #print $p,"\n"; my $e=eval($p); if ( @$ ) { push (@AccVolume,"error"); } else { #print Dumper($e); foreach my $value ( @{$e->{'reportingGroups'}} ) { if ( exists ( $value->{'absoluteAccumulated'}->{'counters' +} ) ) { $AccUsg->{$MobileNumber}->{$value->{'subscriberGroupName +'}}=$value->{'absoluteAccumulated'}->{'counters'}->[0]->{'bidirVolume +'}; } elsif ( exists ( $value->{'absoluteAccumulated'}->{'bidi +rVolume'} ) ) { $AccUsg->{$MobileNumber}->{$value->{'subscriberGroupName +'}}=$value->{'absoluteAccumulated'}->{'bidirVolume'}; } } } } } close(AH);

      Now the the second 16gb i/p file (FH) matches its $planname value with the AccUsage hash and print the matched value to the output file.

      while(<FH>) { chomp; my $line=$_; $SubsCount++; my $msisdn; my $IMEI; my $Circle; my $DeviceType; my $OPTIN; my $PlanType; my $familyId; my $trafficIds; if($line=~/userId:S(\d+)\(\"(\w+)\"\)/) { $msisdn = $2; } if ( $line=~/operatorInfo:A(\d+)\[(.*?)\]/ ) { my $opcinfo=$2; #ix0:S13("OptInState:3G")ix1:S11("CircleId:MH")ix2:S8("DevType +:")ix3:S9("imei:NULL")ix4:S16("PlanType:prepaid") #print $opcinfo,"\n"; if( $opcinfo =~ /ix(\d+):S(\d+)\(\"imei:(\w*)\"\)/ ) { $IMEI=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)\(\"OptInState:(\w*)\"\)/ ) { $OPTIN=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)\(\"CircleId:(\w*)\"\)/ ) { $Circle=$3; if( exists ( $lookup->{$Circle} ) ) { $Circle=$lookup->{$Circle}; } } if( $opcinfo =~ /ix(\d+):S(\d+)\(\"DevType:(\w*)\"\)/ ) { $DeviceType=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)\(\"PlanType:(\w*)\"\)/ ) { $PlanType=$3; } } my @ValidPlan; #groups:A1[ix0:S10("3BASIC:100")] if($line=~/groups:A(\d+)\[(.*?)\]/) { my $plans=$2; my @AllPlans = split('ix\d+:S\d+\("',$plans); #print Dumper($AccUsg->{$msisdn}); foreach my $p ( @AllPlans ) { $p =~ s/\"\)//g; if ($p eq "" ) { next; } #my @planname=split(":",$p); if ( $p =~ /(\w+):(\d+)[:]?(.*)/) { my $planname=$1; my $priority=$2; my $expdate=$3; $expdate =~ s/,/\;/g; if( exists ( $AccUsg->{$msisdn}->{$planname} ) ) { if ( $expdate eq "") { push(@ValidPlan,"$planname;$priority;;;$AccUsg->{$ms +isdn}->{$planname}"); } elsif ( length($expdate) > 19 ) { push(@ValidPlan,"$planname;$priority;$expdate;$AccUs +g->{$msisdn}->{$planname}"); } else { push(@ValidPlan,"$planname;$priority;$expdate;;$AccU +sg->{$msisdn}->{$planname}"); } } else { if( $expdate eq "") { push(@ValidPlan,"$planname;$priority;;;"); } else { if( length($expdate) > 19 ) { push(@ValidPlan,"$planname;$priority;$ex +pdate;"); } else { push(@ValidPlan,"$planname;$priority +;$expdate;;"); } } } } } } if ( $line=~/familyId:S(\d+)\(\"(.*?)\"\)/ ) { $familyId=$2; } if ( $line=~/trafficIds:A(\d+)\[(.*?)\]/ ) { $trafficIds=$2; } #Now sorting the printplans priority wise # my $printPlan=join("|",@ValidPlan); my @validPlanSorted = sort { ($b =~ /(.*?);(\d+);(.*?)/)[1] <=> ($a +=~ /(.*?);(\d+);(.*?)/)[1] } @ValidPlan; my $printPlan=join("|",@validPlanSorted); $finalCount++; #print OUT "$msisdn;$IMEI;$Circle;$DeviceType;$OPTIN;$PlanType;$fami +lyId;$trafficIds;$printPlan\n"; print OUT "$msisdn,$IMEI,$Circle,$DeviceType,$OPTIN,$PlanType,$print +Plan\n"; } close(FH);

      where one record of AH

      P[containerVrsn:U(0)recordVrsn:U(0)size:U(560)ownGid:G[mdp:U(111817893 +5)seqNo:U(55)]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[ +classNo:U(1091971)size:U(532)updateVersion:U(1157)checksum:U(13183789 +05)EPC_UsageControlAccumulatedPot:R[subscriberId:S12("918970692483")u +sageControlAccum:S543("{\"reportingGroups\":[{\"absoluteAccumulated\" +:{\"counters\":[{\"bidirVolume\":3022011787,\"name\":\"base\"}],\"exp +iryDate\":{\"volume\":\"25-05-2013T00:00:00\"},\"previousExpiryDate\" +:{\"time\":\"25-04-2013T00:00:00\",\"volume\":\"25-04-2013T00:00:00\" +},\"reportingLevel\":\"totalTraffic\",\"resetPeriod\":{\"volume\":\"3 +0 days\"}},\"name\":\"110\",\"restartInfo\":\"25-04-2013T00:00:00\",\ +"selected\":\"no\",\"subscriberGroupName\":\"KA_FUP_2G80K_30\",\"subs +criptionDate\":\"25-04-2013T00:00:00\",\"validityTime\":1800}],\"vers +ion\":\"2.0\"}")]]]

      where one record of FH

      P[containerVrsn:U(0)recordVrsn:U(0)size:U(276)ownGid:G[mdp:U(109017151 +1)seqNo:U(28)]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[ +classNo:U(1064620)size:U(248)updateVersion:U(5)checksum:U(928324968)E +PC_SubscriberPot:R[userId:S12("919902995746")groups:A1[ix0:S12("KA_BA +SIC:100")]services:A0[]blacklist_services:A0[]operatorInfo:A5[ix0:S21 +("HomeMNC1:DefaultValue")ix1:S11("CircleId:KA")ix2:S13("OptInState:2G +")ix3:S9("imei:NULL")ix4:S16("PlanType:prepaid")]pccSubscriberPotRef: +M0[]notificationData:A0[]familyId:S0("")trafficIds:A0[]]]]

      where one record of output file

      MSISDN,IMEI,Circle,DeviceType,OPTIN,PlanType,PACKID1;priority;startdat +e;enddate;AccumulateUsage|PACKID2;priority;startdate;enddate;Accumula +teUsage|PACKID3;priorit y;startdate;enddate;AccumulateUsage|PACKID4;priority;startdate;enddate +;AccumulateUsage| 919164032638,NULL,KA,,2G,prepaid,KA_BASIC;100;;;|KA_FUP_2G80K_30;90;22 +-08-2014T00:00:00;20-09-2014T23:59:59;7032609129

      Now I need to add four more fields from AH file (counter, expiry_date, prev_expirydate_volume & prev_expirydate_time) for this I need to add more AccUsage type hashes which are causing memory issue..so how much memory it will need more? I know code is badly written but thats how it is running here in production for few years now..& i can't help it :) ... thanks a lot

Re: How to reslove this memory issue
by pme (Monsignor) on Sep 10, 2014 at 14:29 UTC
    You can move your hashes from memory to disk. See 'man perltie'.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1100083]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-18 15:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found