SimonPratt has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks

Some time ago I developed a Windows Service based system for transferring data between servers on our network. All has been running quite nicely now for almost a year, with one glaring exception. About once every two - three months, a thread will fail when attempting to open a file. As a point of context, there are 5 threads performing the work which run non-stop while the server is on (each thread executing the affected lines of code several thousand times per day). The servers are rebooted once a week and the service is only restarted when the reboot occurs.

The error I get is:

Thread 10 terminated abnormally: Failed to open \\server\share\INQUEUE\loadqueue.lck: Invalid argument at service.plx line 596.
The relevant code (lines 595 and 596) are:
my $lockfile = $destDir."loadqueue.lck"; open my $fhLock, ">", $lockfile or die "Failed to open $lockfile: $!" +;

Any suggestions on why this code is failing would be highly appreciated. I guess the obvious work around would be something like this:

my $lockfile = $destDir."loadqueue.lck"; my $fhLock; do { eval { open $fhLock, ">", $lockfile or die "Failed to open $lock +file: $!" } } until (defined $fhLock);
but without understanding why this error might be occurring, I'm a bit wary of doing this.

Replies are listed 'Best First'.
Re: Problem opening file
by marinersk (Priest) on Jul 08, 2015 at 16:15 UTC

    No need to get complicated. If you don't want the script to die, don't use die.

    my $lockfile = $destDir."loadqueue.lck"; while (!open my $fhLock, ">", $lockfile) { print "Failed to open $lockfile: $!"; sleep(3); ### Changed from 3000 to 3 (parameter is seconds, +not milliseconds!) }

    As to why it's failing -- without more information, hard to tell. My guess is that in the process of opening a file -- usually an atomic operation -- the thread may have stepped on something another thread was doing (or vice versa), and failed without a highly visible cause.

    Thus, my first stab at a workaround was to let it breathe 3 seconds (or whatever is appropriate for the environment) and try again.

    Edit: Changed sleep(3000)to sleep(3)(stupid human tricks when switching back and forth between languages)

      "... let it breathe 3 seconds ..."

      From sleep:

      "Causes the script to sleep for (integer) EXPR seconds, ..."

      You're letting it breathe for 50 minutes. :-)

      [Also sleep(3000) in the code in your second response.]

      -- Ken

        D'oh!

        Fixing...

      "As to why it's failing -- without more information, hard to tell. My guess is that in the process of opening a file -- usually an atomic operation -- the thread may have stepped on something another thread was doing (or vice versa), and failed without a highly visible cause."

      I agree with your comment in principle, however the error returned by open indicates something a bit more basic is occurring. Its as though the arguments I am passing to open aren't being received (at least that is how I interpret the error).

      I've also done everything I can think of to separate the threads and prevent any toe stomping (such as not share'ing variables, using Thread::Queue for IPC, having a single control to ensure work units that interact with each other can only be passed to the same thread). Although thinking about it now, it is entirely possible that an external process is attempting something it shouldn't be doing. Thanks for your comments and suggestions so far, I'll go have a look at what else is running on the server.

        Its as though the arguments I am passing to open aren't being received

        I agree, and that's my point.

        If very far under the covers, these threads are sharing some common space for performing the atomic openoperation, there could be a race condition where one starts to set up up the parameter block, but before the system call is triggered, another thread goes to update the same block of memory, resulting in invalid parameters being in that space once the service call starts to try to read it.

        No doubt one of many possible explanations, but if that's the case, two things come to mind:

        1. A workaround involving retries could buy you a quick and effective, if less than completely graceful, solution, and;
        2. The correct fix, from an engineering perspective, might involve poring over the code for the threads as well as the I/O modules/features to see if there is a prudent place to establish a semaphore or similar gating mechanism to eliminate the race condition.
Re: Problem opening file
by BrowserUk (Patriarch) on Jul 09, 2015 at 14:46 UTC

    The first thing I would do is add $^E to the error message:

    open my $fhLock, ">", $lockfile or die "Failed to open $lockfile: $! +[$^E]";

    That may give you a more platform significant message relating to the error.

    The second thing I would do is ensure that my threads don't terminate (die) from trappable errors.

    How I would go about that would depend upon the structure and usage of the thread procedure; to advise further, I'd need to see the code.

    I'd also like to see several examples of the pathnames that have failed.

    Unaltered; I'm assuming that server & share in the error message posted are italicised because they've been redacted for security reasoning?

    If so, and if you have a few samples of the failing paths available, and if actually necessary, continue to obscure them, but do so in a way that doesn't destroy any patterns in the information. Eg. if the server names are engineering1, engineering2 etc. Then change them to dept1 and dept2. etc.

    I guess I can short-circuit that by asking: is it the same server(s) or share(s) that keep failing; but historically, there are often clues in real information that gets lost in such redactions.

    In a very generic way, unless $lockfile is a shared variable (I hope not), then there is little scope for 'threads' to be the problem here.

    The more likely cause is coincidental concurrency at the file system level (which would equally occur if you were using separate processes). That prediction is strengthened by the very name of the variable.

    To resolve that would require cross-concurrency logging. Ie. each thread or process would need to log its file system activity (at least opens & closes; possibly reads and writes also), in a timely (strictly chronological) order.

    Without seeing the code and structure of the application it is hard to advise on that; but one method is to have a queue (eg.Thread::Queue) and another thread.

    Your work threads log their activity by writing to the queue, and the logging thread simply loops over that queue writing the messages out to a common log file.

    Anyway, that's all mostly speculation. If you want more help, I'd need sight of the code.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Thanks for responding, BrowserUk, this is really helpful.

      $lockfile is definitely not a shared variable

      Yes, I did redact the server and share names. The boss requires that we continue to redact this type of information for security reasons. I can tell you that the server name is the same in every instance, but the share name has been different each time. Here are all of the examples I have in my log files:

      20150509-stderr.log:Thread 6 terminated abnormally: Failed to open \\< +i>PRODSERVERA</i>\<i>BEYC</i>\INQUEUE\loadqueue.lck: Invalid argument + at service.plx line 596. 20150523-stderr.log:Thread 7 terminated abnormally: Failed to open \\< +i>PRODSERVERA</i>\<i>FRT4</i>\INQUEUE\loadqueue.lck: Invalid argument + at service.plx line 596. 20150704-stderr.log:Thread 10 terminated abnormally: Failed to open \\ +<i>PRODSERVERA</i>\<i>JPGE</i>\INQUEUE\loadqueue.lck: Invalid argumen +t at service.plx line 596.

      The logging is being handled by a Thread::Queue (actually by two - one for STDOUT and one for STDERR), so all of the log entries are in chronological order, although I don't capture specific file system activity at the moment.

      In each case, the pattern around what happened is different. In the case of BEYC, the faulting thread loaded files for BEYC, then did a whole bunch of other recipients for the next 12 hours, then crashed on the very next BEYC file to come in. For FRT4, it loaded a bunch of files, with the last file being loaded using a library call and crashing out on the very next file that needed to load. For JPGE, it again loaded a bunch of files successfully, but the last file loaded was passed out to a new Perl instance in a system call before crashing on the next file.

        Okay. Try adding $^E to your logging; it might make for a clearer picture.

        It would be easier to suggest something if you posted the thread code, but at a very minimum I'd rewrite that open something like this:

        my $retries = 5; my $fhLock; while( $retries and not open $fhLock, ">", $lockfile ) { warn "Failed to open $lockfile: $! [$^E]"; sleep 1; --$retries; }

        And then wait until it happens again. That should tell you whether its a temporary, transitory problem or not; and perhaps shed more light on the cause.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Hi, BrowserUk

      Just wanted to say thanks for your suggestion to add $^E to the output. The error has occurred again and the additional information was completely unexpected, however it is totally understandable and I can now close out the issue.

      Just for posterity, the full error (with $^E output in bold) is "Failed to open \\servername\sharename\INQUEUE\loadqueue.lck: Invalid argument An unexpected network error occurred".

Re: Problem opening file
by marinersk (Priest) on Jul 08, 2015 at 16:23 UTC

    One additional complication you might want is to make sure the script isn't in a death spiral.

    my $lockfile = $destDir."loadqueue.lck"; my $failcount = 0; while ($failcount < 10000) { while (!open my $fhLock, ">", $lockfile) { print "Failed to open $lockfile: $!"; $failcount++; sleep(3); ### Changed from 3000 to 3 (parameter is seconds, +not milliseconds!) } last; }

    Edit: Changed sleep(3000)to sleep(3)(stupid human tricks when switching back and forth between languages)

      Yeah, thats one of the reasons I like die'ing. Its easy to trap and log everything without having to fill up my code with fail loops

      Ideally I would like to know why this is happening, because if the thread is stuffed at this point, it is better to continue to let it die and add handling in other locations to allow the service to catch it and spin up a replacement thread.

      You've come up with a very robust way to hide the fact that the failure is occurring (but not forever). Kind of an XY answer - you solved the problem he was having, but not the question he's asking. UPDATE: altho he breached the idea of a workaround first.

      Dum Spiro Spero

        Also, I did address his question in my first answer, below the code.

        Ironically, I recalled the workaround part first, and then the question. Guess I was in LIFOmode today.

A reply falls below the community's threshold of quality. You may see it by logging in.