Largins has asked for the wisdom of the Perl Monks concerning the following question:

Good evening

I have a large number of files that contain meta data formatted as xml. The files are placed in a well organized directory tree, categorized by time and place of creation. I am parsing the XML using HTML::Parser and entering the data that I am interested in into a relational database, by walking the nodes of the tree using File::Find

All is working very well... but there is an issue that has to do with the shear volume of the data that I am processing, tens of thousands of files.
I am thinking of methods that I can use to gracefully interrupt the process, without loss of data, and then pick up where I left off at a later time, and perhaps on a different computer.

My idea, that should work, is to periodically check for the existence of a file (could be stdin) that would have a command instructing the program to finish what it's doing, record where it is, the next file to be processed, and then gracefully exit. Use of a command would allow for future direction from the outside, but thats beyond the scope of my question.

my question, simply put is there a better way, perhaps something in CPAN, signals or whatever. I'm thinking that if I write this, I might very well be reinventing the wheel. And would be happy to use OPC.

Largins

  • Comment on Gracefully exiting and restarting at a later time.

Replies are listed 'Best First'.
Re: Gracefully exiting and restarting at a later time.
by NetWallah (Canon) on Dec 21, 2011 at 04:20 UTC
    So you want to
    • Track tens of thousands of things
    • maintain state
    • Quickly lookup previous state
    • be extensible/flexible
    Sounds like a Database to me.

    sqlite is a small, easy, free and fast database - it would be a good candidate to get started with.

                "XML is like violence: if it doesn't solve your problem, use more."

      Hello

      Well, I am already using sqlite for the storage database, and as stated File::Find to walk the tree. I however am not saving my previous location in the database, but that would complete the where. And a trigger activated from sqlite3 could initiate the operation. Good idea.

      Largins

Re: Gracefully exiting and restarting at a later time.
by BrowserUk (Patriarch) on Dec 21, 2011 at 05:51 UTC

    I'd do it this way:

    1. Produce a list of the files into a file using dir /b/s *.xml >yourfiles. (Or your OS equivalent.)
    2. Open the list of files with Tie::File; and process the tied array backwards.

      Open each file in turn and then delete it from the array once you've processed it:

      #! perl -slw use strict; use Tie::File; tie my( @paths ), 'Tie::File', $ARGV[ 0 ] or die $!; for my $file ( reverse 0 .. $#paths ) { ## Open and process $path[ $file ] print "Processing file: $paths[ $file ]"; ## Remove the path just processed delete $paths[ $file ] }

    To interrupt the processing, just ^C it or kill it or whatever.

    The next time you run the program, it will pick up from where it left off, reprocessing the file it was on when it was interrupted.

    To continue the processing from a different machine, you only need to have access to the filelist file and script and you're away.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Okay

      This would also work, except that there is a needlessly large redundant file involved. I say redundant, because a directory is also a file, and would contain the same information. In order for this to work, the file would also have to be updated after each node was processed.

      So far, keeping the information in the database seems like the best idea (although the updated row in the database would have to be written to a file as well)

      Thanks for your thoughts.

      Largins

        except that there is a needlessly large redundant file involved.

        Hm. If you are going to keep the same list in a DB, it will also end up in a file within the filesystem. And depending upon which DB & schema you use, it will occupy anywhere from a little more to perhaps double as much space as the file.

        In order for this to work, the file would also have to be updated after each node was processed.

        You'd have to update the DB after every file to indicate the file had been processed. And that 'indication', whatever form you chose to use, is still going to end up modifiying a file on disk.

        In the end, whether you use a flat file or a "DB", the same steps have to occur -- build a list; grab them one at a time; process; check them off the list -- and the same essential disk activity must occur.

        The difference is, with a DB, you'll also get a whole raft of additional IO going on for its internal logging and journalling activity. All of which is required for its ACID compliance and/or transactional safety, but which is unnecessary overkill for such a simple -- build a list and discard each item when you've processed it -- application.

        Not to mention all the additional complexity involved in setting up, maintaining and using the DB.

        I like simple, but, each to their own :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Gracefully exiting and restarting at a later time.
by ambrus (Abbot) on Dec 21, 2011 at 10:27 UTC

    It might be better if you recorded your status after you've finished with each file. The advantage of this is that then even if your script dies unexpectedly, you can continue processing from wherever you were. For example, your script can record the list of files it has fully processed in a separate progress file, and when you re-run the script, it should read that file and skip the files that are already processed.

    Once you implement this, if processing the files are idempotent, then you can simply kill the script any time, whatever it's doing. Otherwise, you may want to implement a cleaner way to shut down, like you mention in your question: eg. periodically check for existence of a file, and if it does not exist (because you've deleted it), quit the script. Even then though, it's worth to save the progress occasionally, such as by writing to the progress file after each file, to avoid having to redo all the computation if the script dies for some unexpected reason, be it unexpected input, bug in your script, power failure, memory full, or something else.

    Let me point to an example that may help you. The script wgetas - download many small files by HTTP, saving to filename of your choice has two measures for continuing after an interruption. Namely, to avoid repeating successful downloads, the script does not attempt to download any file if the output file it would save to already exists locally – this works only because the script creates the output file atomically, so the output file cannot exist if the script was interrupted during the download of that same file. Further, to avoid retrying downloads that have failed in a permanent way (such as the file not existing on the remote server), if the script is invoked with the -e option, a progress file is written with the names of downloads already processed. (It's important that output to the progress file is flushed after writing each filename.)

      Greetings

      This idea, combined with keeping the information in the database is what I shall do. I will store the filename and directory after processing, in the table and use auto-commit.

      The download portion has already been completed, so don't have to worry about that.

      I will still have to rewalk the directory tree but the restart will only have to check directory name to find out if it is in the right place. This is necessary (I'm pretty sure), to avoid bypassing unfinished nodes in the directory walk.

      Thanks one and all for your input, I have made my decision, and as always, it is a better one than when I started thanks to Perl Monks!

      Largins

Re: Gracefully exiting and restarting at a later time.
by Anonymous Monk on Dec 21, 2011 at 05:07 UTC

    Hi,

    A very simple solution is to check at the start or end of the main loop for the existence of a stop file. If found the program would then write away it's state to a file or DB ( I like SQLite too ).

    Also delete the stop file as the first thing you do in the program.

    Simple and easy to work/maintain.

    J.C.

Re: Gracefully exiting and restarting at a later time.
by LanX (Saint) on Dec 22, 2011 at 15:38 UTC
    IMHO all answers you got so far dealt with "how do I preserve the state persistently such that I can safely continue"

    But I understand your question as "how do I tell a running program to stop and to resume"

    Two ideas come into my mind if you don't want to pull information from the file-system:

    1. A signal-handler:

    (Generally intercepting all "kill" signals is anyway a good idea in your case) The perl cookbook has examples regarding sig-handlers.

    2. Putting everything into a Webservice and sending the stop message via http.

    HTH :)

    Cheers Rolf

    UPDATE:

    maybe a good read: perlipc

Re: Gracefully exiting and restarting at a later time.
by TJPride (Pilgrim) on Dec 21, 2011 at 07:03 UTC
    Well, you definitely don't want to die in the middle of a file - that'll make things too complicated - so you probably don't want to just kill and restart it. What you -could- do is check memory usage after each file and then exit if you've built up more than a certain percentage of memory usage. This works on my system:

    use strict; use warnings; my ($data, $mem); my $kb = ' ' x 1024**2; do { $data .= $kb; $_ = `ps -p $$ -o %mem`; ($mem) = m/(\d+\.\d+)/; } while ($mem < 50);

    You could also have it check for a file named a certain thing (like exit.txt) and exit if it sees that as well, deleting before exit so you're prepped for the next run.