How to reslove this memory issue

Ankur_kuls has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to reslove this memory issue by BrowserUk (Patriarch) on Sep 10, 2014 at 07:31 UTC
Three possible approaches: Buy another 8GB DIMM for your server. At ~Ł70/$100, this is quick, simple and very cheap. Datasets only ever seem to get bigger, so this would somewhat future proof you. Get cleverer about the way you build your indexes. (You currently use hashes for this.) Depending upon your data, there may be less memory intensive ways of building your indexes. A few (real or realistic) examples of the data showing the datatype (string/real/integer) of the keys and the size and nature of the values, would me far more useful than your script which I think you've adequately described. Should result in an equally fast (or possibly faster) processing; but requires 'cooperative data', so may not be applicable; and requires some rework of your script, though the basic structure would remain the same. Process each of your fields in separate passes of the script to produce intermediate output files, and then use a final pass over those intermediate files to merge them. Slow. Quite a lot of work. A fourth approach would be to use a database, but I'll let others tell you about that. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: How to reslove this memory issue by CountZero (Bishop) on Sep 10, 2014 at 08:30 UTC
Did you consider using a database? The info you provide is somewhat sketchy, but it looks rather straightforward to transform each of your input files into a database table and then write an SQL query to extract the data. Adding a few more fields to your output would be no more work than editing your SQL query and adding those fields to the "select" and "from" keywords. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: How to reslove this memory issue by shmem (Chancellor) on Sep 10, 2014 at 10:17 UTC
...input file and stores into one hash. A quick approach to store a single hash on disk is DB_File which interfaces Berkeley DB. For more complex data structures, there's DBM::Deep. You will need disk space. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l]
Re^2: How to reslove this memory issue by QM (Parson) on Sep 10, 2014 at 14:13 UTC
I'll second these suggestions. Note that any file solution will increase the execution time. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: How to reslove this memory issue by Hameed (Acolyte) on Sep 10, 2014 at 07:41 UTC
I found Optimising processing for large data files.. It might help.	[reply]
Re: How to reslove this memory issue by dsheroh (Monsignor) on Sep 10, 2014 at 13:40 UTC
This is a common misunderstanding of the output of `free`. The memory statistics you showed do not indicate excessive memory use. On the contrary, they show only about 2.3% of system memory in active use by programs. If you look more closely at the first line of numbers, you will see that 7928420 of the 8124068 memory pages in use are being used to cache data. This memory is, for all intents and purposes, free. Your operating system it using it to improve system performance, but can throw the cached data out and reallocate the memory for other uses more-or-less instantly if the memory is needed elsewhere. To see whether you actually have a memory shortage, you need to look at the second line of numbers, which adjusts the totals by treating buffers and cache as free memory rather than used (because, again, if needed, that memory can be reassigned instantly). Looking at that line, you will see that only 182400 pages are in use for "real" data, leaving 7975776 available.	[reply] [d/l]
Re^2: How to reslove this memory issue by sundialsvc4 (Abbot) on Sep 11, 2014 at 13:32 UTC
One possible thing to consider, though, is that if this command was executing on a server that was doing many other things at the same time, we don’t necessarily know who is using those buffers, nor what distribution of buffer use is needed, on the system as a whole, to make everything run efficiently. When the OP said that the activity had a serious impact, I believe him. Stealing a buffer is not a no-cost operation, because now it means that you must do another physical disk-read to re-obtain whatever it once held. Even though I’m just on my first cup of coffee this morning, I also have a healthy skepticism about these numbers. They seem much too small, given the reports of what this program is supposed to be doing and how much memory it ought to be taking up. You really need to profile this thing, to see what it’s waiting on ... how much disk I/O, how many page faults and so on. How much virtual-memory it wants vs. what its resident page-set size is, under actual load conditions. It could well be that a database-oriented solution (which can basically do a job like this with a `JOIN`) might be a better solution for two reasons. First, that a “memory oriented” solution might well still wind up doing an equivalent amount of disk I/O ... especially if several other big programs are running elsewhere ... and because, with two tables and a JOIN, this requirement would probably be completed by now. After all, of this we can be sure: soon, there will be four more fields to add; then, Marketing will ask for another six. So it goes. This approach seems to be leading quite rapidly to a corner-box, and is rapidly losing its “shine.”	[reply]
Re: How to reslove this memory issue by Discipulus (Canon) on Sep 10, 2014 at 07:41 UTC
Hello, vague question, vague answer... Search the monastery for large files and try using Iterators. I'm not the best one here around but i think Perl does not release memory to the OS (i think it depends by the OS) but when you free memory used by the current instance of Perl program it become available to the program itself. So be sure to control the scope of your buffers 'cause when something goes out of scope it frees his memory, especially file handles. If you can post the minimum code that produces the same memory consumption i'm sure you'll find some good advice. HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: How to reslove this memory issue by Ankur_kuls (Sexton) on Sep 10, 2014 at 09:32 UTC
Hi.. Thanks for the reply. Below is the minimum code. In this part records of 1gb i/p file(AH) is processed in such a way that they take the form of a hash reference. Now values of field 'bidirVolume' is stored in a new array AccUsage. my $AccUsg; while(<AH>) { chomp; my $line=$_; $AccuCount++; my $MobileNumber; if( $line =~ /subscriberId:(\w+)$\"(\d+)\"$/ ) { $MobileNumber=$2; } my $plan=$line; $plan =~s/\\//g; my @AccVolume; if ( $plan =~ /usageControlAccum:(\w+)$\"(.)\"$/ ) { my $p=$2; $p=~s/:\{/ => {/g; $p=~s/:\[/ => [/g; $p=~s/\"/\'/g; $p=~s/\':/\'=>/g; $p=~s/\}n/\}/g; #print $p,"\n"; my $e=eval($p); if ( @$ ) { push (@AccVolume,"error"); } else { #print Dumper($e); foreach my $value ( @{$e->{'reportingGroups'}} ) { if ( exists ( $value->{'absoluteAccumulated'}->{'counters' +} ) ) { $AccUsg->{$MobileNumber}->{$value->{'subscriberGroupName +'}}=$value->{'absoluteAccumulated'}->{'counters'}->[0]->{'bidirVolume +'}; } elsif ( exists ( $value->{'absoluteAccumulated'}->{'bidi +rVolume'} ) ) { $AccUsg->{$MobileNumber}->{$value->{'subscriberGroupName +'}}=$value->{'absoluteAccumulated'}->{'bidirVolume'}; } } } } } close(AH); [download] Now the the second 16gb i/p file (FH) matches its $planname value with the AccUsage hash and print the matched value to the output file. while(<FH>) { chomp; my $line=$_; $SubsCount++; my $msisdn; my $IMEI; my $Circle; my $DeviceType; my $OPTIN; my $PlanType; my $familyId; my $trafficIds; if($line=~/userId:S(\d+)$\"(\w+)\"$/) { $msisdn = $2; } if ( $line=~/operatorInfo:A(\d+)\[(.?)\]/ ) { my $opcinfo=$2; #ix0:S13("OptInState:3G")ix1:S11("CircleId:MH")ix2:S8("DevType +:")ix3:S9("imei:NULL")ix4:S16("PlanType:prepaid") #print $opcinfo,"\n"; if( $opcinfo =~ /ix(\d+):S(\d+)$\"imei:(\w)\"$/ ) { $IMEI=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)$\"OptInState:(\w)\"$/ ) { $OPTIN=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)$\"CircleId:(\w)\"$/ ) { $Circle=$3; if( exists ( $lookup->{$Circle} ) ) { $Circle=$lookup->{$Circle}; } } if( $opcinfo =~ /ix(\d+):S(\d+)$\"DevType:(\w)\"$/ ) { $DeviceType=$3; } if( $opcinfo =~ /ix(\d+):S(\d+)$\"PlanType:(\w)\"$/ ) { $PlanType=$3; } } my @ValidPlan; #groups:A1[ix0:S10("3BASIC:100")] if($line=~/groups:A(\d+)\[(.?)\]/) { my $plans=$2; my @AllPlans = split('ix\d+:S\d+$"',$plans); #print Dumper($AccUsg->{$msisdn}); foreach my $p ( @AllPlans ) { $p =~ s/\"$//g; if ($p eq "" ) { next; } #my @planname=split(":",$p); if ( $p =~ /(\w+):(\d+)[:]?(.)/) { my $planname=$1; my $priority=$2; my $expdate=$3; $expdate =~ s/,/\;/g; if( exists ( $AccUsg->{$msisdn}->{$planname} ) ) { if ( $expdate eq "") { push(@ValidPlan,"$planname;$priority;;;$AccUsg->{$ms +isdn}->{$planname}"); } elsif ( length($expdate) > 19 ) { push(@ValidPlan,"$planname;$priority;$expdate;$AccUs +g->{$msisdn}->{$planname}"); } else { push(@ValidPlan,"$planname;$priority;$expdate;;$AccU +sg->{$msisdn}->{$planname}"); } } else { if( $expdate eq "") { push(@ValidPlan,"$planname;$priority;;;"); } else { if( length($expdate) > 19 ) { push(@ValidPlan,"$planname;$priority;$ex +pdate;"); } else { push(@ValidPlan,"$planname;$priority +;$expdate;;"); } } } } } } if ( $line=~/familyId:S(\d+)$\"(.?)\"$/ ) { $familyId=$2; } if ( $line=~/trafficIds:A(\d+)\[(.?)\]/ ) { $trafficIds=$2; } #Now sorting the printplans priority wise # my $printPlan=join("\|",@ValidPlan); my @validPlanSorted = sort { ($b =~ /(.?);(\d+);(.?)/)[1] <=> ($a +=~ /(.?);(\d+);(.*?)/)[1] } @ValidPlan; my $printPlan=join("\|",@validPlanSorted); $finalCount++; #print OUT "$msisdn;$IMEI;$Circle;$DeviceType;$OPTIN;$PlanType;$fami +lyId;$trafficIds;$printPlan\n"; print OUT "$msisdn,$IMEI,$Circle,$DeviceType,$OPTIN,$PlanType,$print +Plan\n"; } close(FH); [download] where one record of AH P[containerVrsn:U(0)recordVrsn:U(0)size:U(560)ownGid:G[mdp:U(111817893 +5)seqNo:U(55)]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[ +classNo:U(1091971)size:U(532)updateVersion:U(1157)checksum:U(13183789 +05)EPC_UsageControlAccumulatedPot:R[subscriberId:S12("918970692483")u +sageControlAccum:S543("{\"reportingGroups\":[{\"absoluteAccumulated\" +:{\"counters\":[{\"bidirVolume\":3022011787,\"name\":\"base\"}],\"exp +iryDate\":{\"volume\":\"25-05-2013T00:00:00\"},\"previousExpiryDate\" +:{\"time\":\"25-04-2013T00:00:00\",\"volume\":\"25-04-2013T00:00:00\" +},\"reportingLevel\":\"totalTraffic\",\"resetPeriod\":{\"volume\":\"3 +0 days\"}},\"name\":\"110\",\"restartInfo\":\"25-04-2013T00:00:00\",\ +"selected\":\"no\",\"subscriberGroupName\":\"KA_FUP_2G80K_30\",\"subs +criptionDate\":\"25-04-2013T00:00:00\",\"validityTime\":1800}],\"vers +ion\":\"2.0\"}")]]] [download] where one record of FH P[containerVrsn:U(0)recordVrsn:U(0)size:U(276)ownGid:G[mdp:U(109017151 +1)seqNo:U(28)]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[ +classNo:U(1064620)size:U(248)updateVersion:U(5)checksum:U(928324968)E +PC_SubscriberPot:R[userId:S12("919902995746")groups:A1[ix0:S12("KA_BA +SIC:100")]services:A0[]blacklist_services:A0[]operatorInfo:A5[ix0:S21 +("HomeMNC1:DefaultValue")ix1:S11("CircleId:KA")ix2:S13("OptInState:2G +")ix3:S9("imei:NULL")ix4:S16("PlanType:prepaid")]pccSubscriberPotRef: +M0[]notificationData:A0[]familyId:S0("")trafficIds:A0[]]]] [download] where one record of output file `MSISDN,IMEI,Circle,DeviceType,OPTIN,PlanType,PACKID1;priority;startdat +e;enddate;AccumulateUsage\|PACKID2;priority;startdate;enddate;Accumula +teUsage\|PACKID3;priorit y;startdate;enddate;AccumulateUsage\|PACKID4;priority;startdate;enddate +;AccumulateUsage\| 919164032638,NULL,KA,,2G,prepaid,KA_BASIC;100;;;\|KA_FUP_2G80K_30;90;22 +-08-2014T00:00:00;20-09-2014T23:59:59;7032609129` [download] Now I need to add four more fields from AH file (counter, expiry_date, prev_expirydate_volume & prev_expirydate_time) for this I need to add more AccUsage type hashes which are causing memory issue..so how much memory it will need more? I know code is badly written but thats how it is running here in production for few years now..& i can't help it :) ... thanks a lot	[reply] [d/l] [select]
Re: How to reslove this memory issue by pme (Monsignor) on Sep 10, 2014 at 14:29 UTC
You can move your hashes from memory to disk. See 'man perltie'.	[reply]


Just another Perl shrine
	PerlMonks