Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Reduce CPU utilization time in reading file using perl

by madtoperl (Hermit)
on Sep 27, 2013 at 16:07 UTC ( [id://1056015]=perlquestion: print w/replies, xml ) Need Help??

madtoperl has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a very big input file of around 5590600 MB size and while opening the file and reading the file it is taking 99 or 100% utilisation of my CPU. But I am trying to reduce the CPU utilisation to less than 50% and not able to succeed with that. In the execution of "top" command in command prompt first it says CPU usage: 28.2% but in the detailed PID it says 100% usage. So not sure if it is really taking 100% or only 28%.If it is taking only 28% of overall 100% CPU then it is fine. Else could you please let me know how to reduce the cpu time to less than 50%
tie @lines, 'Tie::File', "testfile.dat" or die "Can't read file: $!\n" +; $linecount = $#lines+1; #print "Linecount=> $linecount\n"; foreach ( @lines ) { chomp; ($type, $No, $date) = split(/\|/); $hash{$No.$date} = $type."@".$No."@".$date; } untie @lines;
PROCESS TIME Processes: 135 total, 4 running, 6 stuck, 125 sleeping, 926 threads + + + 21:13:58 Load Avg: 1.22, 1.17, 1.07 CPU usage: 28.2% user, 3.32% sys, 68.64% i +dle SharedLibs: 10M resident, 9736K data, 0B linkedit. MemRegions: +28168 total, 2921M resident, 76M private, 568M shared. PhysMem: 1056M + wired, 3782M active, 971M inactive, 5810M used, 2380M free. VM: 318G vsize, 1054M framework vsize, 248025(0) pageins, 0(0) pageout +s. Networks: packets: 453919/88M in, 393620/58M out. Disks: 65480/378 +2M read, 137948/7081M written. PID COMMAND %CPU TIME #TH #WQ #PORT #MREGS RPR +VT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID FA +ULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW + PAGEINS KPRVT KSHRD USER perl5.12 99.9 00:08.58 1/1 0 22 56+ 99M+ 124 +4K

Replies are listed 'Best First'.
Re: Reduce CPU utilization time in reading file using perl
by BrowserUk (Patriarch) on Sep 27, 2013 at 16:26 UTC

    Using Tie::File on such a huge file -- or any file over a few (single digit) megabytes is stupid. It will use huge amounts of cpu and be very slow.

    You can build your hash much (much, much) more quickly this way:

    open BIGFILE, '<', "testfile.dat" or die "Can't open file: $!\n"; my %hash; while( <BIGFILE> ) { chomp; my( $type, $No, $date ) = split(/\|/); $hash{$No.$date} = $type."@".$No."@".$date; } close BIGFILE; ## do something with the hash.

    Will use far less cpu & memory and complete in less than 1/2 the time.

    However, it is really doubtful that you will be able to build a hash from that size of file without running out of memory unless:

    • there are huge numbers of duplicate records in that file.
    • You have a machine that has huge amounts of memory.
    • You have a huge swap partition. (Preferably sited on a SSD).

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi BrowserUK,
      Thanks a lot for your inputs. I have tried your option as well, sitll the CPU usage is 100%.Is it possible to load only one line into memory from the huge file and store it into hash or without opening directly possible to store it into hash. I am worrying that it may not be very less possible. Still thought of getting your suggestion.
      Thanks
      madtoperl
        I have tried your option as well, sitll the CPU usage is 100%.

        That is because you are using more memory for the hash than you have installed, thus, some parts of the memory holding the hash are being swapped or paged to disk as the file is being read. The nature of the way hashes are stored means that pages of memory are constantly being written to disk and then re-read, over and over; and that is what is driving up your cpu usage.

        Is it possible to load only one line into memory from the huge file and store it into hash or without opening directly possible to store it into hash.

        That is what my code does. Read's one line installs into the hash then reads the next. It is the size of the hash that is the problem, not the line-by-line processing of the file.

        I am worrying that it may not be very less possible. Still thought of getting your suggestion.

        There are various ways of providing access to huge amount of data without requiring that it all be held in memory concurrently. Which of those methods/mechanisms is appropriate for your purpose depends entirely upon what you need to do with that data.

        So, before advising further, you need to answer the question: Why are you attempting to load all the data into a hash?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        p
Re: Reduce CPU utilization time in reading file using perl
by talexb (Chancellor) on Sep 27, 2013 at 17:11 UTC

    I'd say your best bet is to process the file one line at a time, as has already been suggested, and put the relevant information into a database. Once it's there, you can gather and extract the information you need -- and let the database engine worry about figuring out the best way to store it all.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: Reduce CPU utilization time in reading file using perl
by aaron_baugher (Curate) on Sep 27, 2013 at 18:09 UTC

    As others have said, there are better ways to do this. But as for your question on CPU usage: the 99.9% that ps is reporting means that your perl process is using 99.9% of the CPU that is being used. So no other processes are using much to speak of. The usage reported by top is based on total capacity, so idle time is included in the total there.

    To reduce a process's load on a system, see nice or the section on nice in your shell's man page.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Reduce CPU utilization time in reading file using perl
by wink (Scribe) on Sep 27, 2013 at 17:19 UTC

    You're saying your file is over 5 TB (terabytes) in size... is that what you actually meant? Unless your system has a few TB of memory, good luck.

    The differing CPU utilization is probably from a multi-core processor. Without multi-threading, you'll never use more than 1 core at a time.

Re: Reduce CPU utilization time in reading file using perl
by Laurent_R (Canon) on Sep 27, 2013 at 17:05 UTC

    I agree with BrowserUK that loading such a huge file into memory even before you start reading the first line is a no go. You really want to iterate line by line over the file to reduce the memory used. Having said that, it is unlikely you will be able to fit 5 million megabytes into a hash.

    On your CPU usage question, how many CPUs/cores do you have?

      Hi Laurent R<br. Thanks a lot. I have only one CPU. Why the CPU usage says 23% in one place and 100% in other place. Is my script the CPU's whole 100% while running my script or only 23% of the whole CPU's time. Could you please clarify.
      Thanks
      madtoperl

        Hi, my question was: how many CPUs/cores (not just CPUs) do you have. Even if you have only one (e.g. Intel) CPU, but with, say, five cores, your process might very well take more or less 100% of one core's processing power but still leave the 4 other cores almost completely idle. Since you are not forking subprocesses in your program nor using any threads, your process can basically only use one core (the system itself might be able to delegate a small fraction of its own work to another core, but this is likely to be very limited). So you might very well use 100% of one core's processing power, but only 20 or 25% of the CPU total processing power.

Re: Reduce CPU utilization time in reading file using perl
by Discipulus (Canon) on Sep 30, 2013 at 08:03 UTC
    May be you'll find useful to read about iterators in precious book High Order Perl.
    On CPAN there is a module too: Iterator
    there are no rules, there are no thumbs..
Re: Reduce CPU utilization time in reading file using perl
by Corion (Patriarch) on Sep 30, 2013 at 08:13 UTC

    If you are comparing two files for common/different keys, and if both files are about the same (huge) size, I guess you will have to get smarter than keeping all the information in memory (because you don't have enough memory).

    If you can make an educated guess as to where in a file a key is likely to be found, you could use seek to look for the key in the file. This is horribly slow, but likely still faster than swapping memory. If you want to be fancy, you can cache parts of the file in memory.

    If you cannot make an educated guess, I guess it will pay off to convert at least one file into a file with all your keys in fixed width, sorted by the keys. Then you can easily make an educated guess to find a given key. If you convert both files to that structure, you can easily find the keys missing in one of the two files by reading through the sorted key files line by line. This approach will roughly double your disk requirements.

Re: Reduce CPU utilization time in reading file using perl
by Laurent_R (Canon) on Sep 30, 2013 at 11:40 UTC

    If your file are sorted in accordance with the comparison key, then you can iterate through the two files in parallel. This can be very very fast. Just a couple of hours ago, I compared two 100-MB files this way, it took less than 3 seconds to run.

    $time perl compare_files.pl real 0m2.378s user 0m1.384s sys 0m0.069s

    Even if they are not sorted, this might still be the solution: first to sort both files and then read them in parallel. The only difficulty is to get the parallel reading really correct.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1056015]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-03-29 06:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found