perlyr has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I was running a perl program extracting strings from over 15,000 text files. It run really well until the 6659th file and suddenly terminated. I thought something was wrong with the 6660th file, but the program run as usual if I just run the program on these files: 6660, 6661,and 6662. Any thoughts on this? I wish I could post those files, but they are too big (about 20GB). Thanks!

Replies are listed 'Best First'.
Re: extracting strings from text files
by BrowserUk (Patriarch) on Jul 08, 2012 at 04:22 UTC
    I wish I could post those files, but they are too big (about 20GB)

    The files wouldn't help us to help you, but the code you are using to read them would.

    Some possibilities are:

    • You are running out of file handles;
    • you are running out of memory;
    • your OS/user account is set up to kill your processes if the exceed certain limits of cpu/IO.memory usage;
    • many others.

    If you posted the code, most of those could be eliminated very easily and the real reason might well be obvious to someone here.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: extracting strings from text files
by xyzzy (Pilgrim) on Jul 08, 2012 at 04:31 UTC

    Yeah, this is probably a memory issue. Do you have any tools to monitor virtual memory usage? You might want to use them and see how much your script eats up and how it grows over time.


    $,=qq.\n.;print q.\/\/____\/.,q./\ \ / / \\.,q.    /_/__.,q..
    Happy, sober, smart: pick two.
      maybe you are right. Do you know how to adjust the memory allocation then? Thanks

        1. Do you see any error messages or access to OS level log messages when you run/reach the stage.

        2.Monitoring and changing the memory/OS user limits depend on your OS port/platform. On most UNIX like/Linux you can monitor virtual memory with OS commands like vmstat,free.On linux you can even use even any of decent free tools like collectl etc to monitor system resource/memory usage. (ulimit can tell your current settings). Task manager in Windows is a best place to start.

        3.Above all it might be better to get the issue correctly identified/verified before attempting to increase limits some of which may need root access.

      No, I don't know how to monitor that. Not sure this is a memory issue. I just bought this desktop which has 16GB memory. Below is the major part of the code.
      foreach (@infiles) { my $data; my $address = $_; unless(open (FILE,"<$_")){ print STDERR "could not open $_: $!\n"; next;} local $/; $data=<FILE>; my($html) = $data =~ /(<p>|<\/P>|&nbsp;|<html>)/im; if (defined($html)){ $data =~ s/<[^>]+>/\n/g; $data =~ s/&nbsp;//ig; $data =~ s/&lt;/</ig; $data =~ s/&gt;/>/ig; $data =~ s/&quot;/"/ig; $data =~ s/&amp;/&/ig; $data =~ s/&#\d+;/&/ig; $data =~ s/&&/ &/ig; $data =~ s/(\w+)&(\d+)/$1 $2/ig; } my($block, $lines, $auditor, $auditorcity, $auditorstate, $date_audited, $keyword1, $keyword2, $keyword3, $keyword4, +$block2,); ($keyword1) = $data =~ /(have(\s*|\n)audited [^a])/i; ($keyword2) = $data =~ /(our(\s+|\n)audits|consent to)/i; if (!defined($keyword1) && !defined($keyword2)) {($block) ="" +;} elsif (!defined($keyword1) && defined($keyword2)) {($block) = + $data =~ /($keyword2(?:[^\n]*\n){1,350})/im;} elsif(defined($keyword1)) {($block) = $data =~ /($keyword1( +?:[^\n]*\n){1,350})/im;} ($keyword3) = $block =~ /(((REPORT(\s+|\n)OF|CONSENT OF)|)\s +*INDEPENDENT\s*(CERTIFIED|registered|registered CERTIFIED|)\s*PUBLIC\ +s*(ACCOUNTANTS|accounting(\s+|\n)firm)|REPORT OF INDEPENDENT ACCOUNTA +NTS|REPORT OF INDEPENDENT AUDITORS|)/im; if (defined($keyword3)) {($lines) = substr ($block,1,index($b +lock,$keyword3)-1);} elsif (!defined($keyword3)){($lines) = $block;} ($auditor) = $lines =~ /^\s*((?:[A-Z]\w+|\/\w+|&\s*\w+).+?\s* +(LLP|L\.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(\.|))$/m; if(defined ($auditor)){ ($auditorcity) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|))|\(.+\)|)\s*(.+?)(?<![\d]),\s* +(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:(?:.+?\s*(?:LLP|L\ +.L\.P\.|LLC|LTD|P.A.|P\.C\.|PC)(?:\.|)|\(.+\)|)\s*.+?(?<![\d]),\s*)(. ++?)$/m; ($date_audited) = $lines =~ /$auditorstate\s*(\w+\s*(\d\d|\d) +,\s*\d{4})\s*($|except|\()/m; } if(!defined($auditorcity) && !defined($auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+)(?<![\d]),\s*(?:\w+)(? +<!LLP)$/m; ($auditorstate) = $lines =~ /^\s*(?:\w+(?<![\d]),\s*)(\w+)(? +<!LLP)$/m; ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})\ +s*($|except|\()/m; } if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))(?<![\d])$/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)(?<![\d])$/m;} if (!defined($auditorcity) && !defined($auditorstate) && defi +ned($date_audited)){ ($auditorcity) = $lines =~ /^\s*(\w+|\w+ \w+|\w+ \w+ \w+)(?: +(?<![\d]),\s*(?:\w+|\w+ \w+|\w+ [^a] \w+))\s*$date_audited/m; ($auditorstate) = $lines =~ /^\s*(?:(?:\w+|\w+ \w+|\w+ \w+ \w ++)[^\d],\s*)(\w+|\w+ \w+|\w+ [^a] \w+)\s*$date_audited/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /\n{2,}\s*(.+?)?\s+$auditorcity/m; } if (!defined($auditor) && defined($auditorcity)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)\s*$auditorcit +y/m;} if (!defined($auditor)){ ($auditor) = $lines =~ /(?:\/s\/)\s*([^-]+?)$/m;} if (!defined($auditor)){ ($lines) = $data =~ /(consent to(?:[^\n]*\n){1,90})/im; ($auditor) = $lines =~ /^\s*(.+?\s*(LLP|L\.L\.P\.|LLC|L +TD|P.A.|P\.C\.|PC))$/m; if(defined ($auditor) && !defined($auditorcity) && !defined($ +auditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?),\s*(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?,\s*)(.+?)$/m;} } if (!defined($auditor)){ ($auditor) = $data =~ /((PWC|KPMG|ERNST & YOUNG|DELOITTE & + TOUCHE|PRICEWATERHOUSECOOPERS|Young LLP)\s*(LLP|))/i;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(.+?)(?<![\d]),\s*(?:.+ +?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:.+?(?<![\d]),\s*)(.+ +?)$/m;} if(defined ($auditor) && !defined($auditorcity) && !defined($au +ditorstate)){ ($auditorcity) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(.+?),(?:.+?)$/m; ($auditorstate) = $lines =~ /$auditor\s*(?:\w+ (?:\d|\d\d)),\s* +\d{2,4}\s*(?:.+?),(.+?)$/m;} if(!defined($date_audited)){ ($date_audited) = $lines =~ /^\s*(\w+\s*(\d\d|\d),\s*\d{4})(,| +$)/m;} print OUTFILE "$auditor\t"; print OUTFILE "$auditorcity\t"; print OUTFILE "$auditorstate\t"; print OUTFILE "$date_audited\n";

        There is nothing obviously wrong with the code -- at least in terms of causing the script to silently terminate -- which eliminates many of the possibilities.

        The next thing I would look at is does the script accumulate (leak) memory as it runs? On windows look at the task manager, on unix try man top and see whether the memory usage just keeps growing? At what rate per file? And what is the total memory usage when it quits?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        I can't see in your code where you are closing the filehandle?????

        -- 
        Ronald Fischer <ynnor@mm.st>