in reply to Trying to optimize reading/writing of large text files.

As I know, Perl's flock should prevent other scripts to write into a file if it is already opened by another script.

This is not true.
File Locking is a cooperative mechanism. If programs want to "share a file", they have to cooperate amongst each other in terms of reading and writing. That basically means that they have to be "polite" with each other.

Oh, I want my chance to speak! You can "get in line" for your chance, but you have to be aware that somebody else is currently "speaking". You have to wait your turn and know that you have to do that.

File "locking" is a way for programs to cooperate if they choose to do so. It does not prevent some "rogue" program from causing a lot of trouble.

I don't think that Version#1 works because I don't see an UNLOCK operation.

Update: I had to re-boot my Windows PC due to incessant messages... continuing...

I would suggest that you review flock in the Perl Docs. Basically what this suggests is that you acquire a "write lock", an "exclusive lock" to the file. Do what you want as quickly as possible, position the file pointer to the end of the file (if you've been doing some seeking around), release the lock. If everybody (all programs) using that file does this, then everybody will be happy even if they share the same file handle.

If whoever is writing to this file is "not cooperating" with file locks then there are going to be problems. A write or "print" operation is not "atomic" in a general sense. Version #1 can get a malformed line.. could be a partial line at the end of the file due to the other writer not "cooperating".

In Version #2, this will save memory (no quickie memory copy) at the expense of perhaps slowing down the "writers" to the file. But be aware that no matter what you do, if the programs are not cooperating, there is the possibility of malformed lines at the end of the file.

I guess I should point out that a "file lock" is a very perfomant operation because the OS keeps this status as a memory resident structure. Acquiring and releasing a file lock on a zero byte file is one way to coordinate some kinds of intra-process communications.

Another Update: See: temp files.
my $tempfile = "./tempfile".(rand()*9999);
is not a good way for this. Ask the OS for a guaranteed unique temp file name. When you are finished with it, I would clean it up (delete it), however the OS will view this file as "trash" and delete it whenever the next "clean up" operation is run on the system - but don't leave trash laying around if you can help it - drinking beer is ok, but don't leave the cans in park.

  • Comment on Re: Trying to optimize reading/writing of large text files.

Replies are listed 'Best First'.
Re^2: Trying to optimize reading/writing of large text files.
by nikkimouse (Initiate) on Jan 22, 2012 at 01:05 UTC
    Hi Marshall!
    Thanks for your reply.

    > File "locking" is a way for programs to cooperate if they choose to do so.
    Only another instances of the same script have access to the file, so I think there will be no problem.

    > I don't think that Version#1 works because I don't see an UNLOCK operation.
    Actually, Version#1 worked over a year. I thought that close() operation unlocks the file. Isn't that correct?

    > my $tempfile = "./tempfile".(rand()*9999);
    > is not a good way for this. Ask the OS for a guaranteed
    > unique temp file name. When you are finished with it, I would clean it up (delete it)

    I guess rename() operation solves the problem by renaming/moving temp file into the original filename. But I'm not sure how does rename() interfere with flock and close(). Does it remove previously made lock, or overwritten file continues to be locked until I close() it?
      Only another instances of the same script have access to the file, so I think there will be no problem.

      It wasn't clear to me who the "writer" was that put the data into this LOG file to begin with. It appeared that you were processing/modifying some file that some other process was appending to (without further info that's what I figured a LOG file was - the term "LOG" just conjures up that image). Perhaps a better name would be "SHARED_CONFIG" or some such name? What you name things within the program actually matters a lot! The use of the name "LOG" triggered some reflexive brain activity and a lot of assumptions about what this file meant.

      From your further description, it appears that you are using this file as a kind of IPC (inter process communication) between instances of your own program. You read this file in, process it, write it back out.

      flock mode 2 is LOCK_EX (exclusive lock, a "write lock"). From my reading of the Perl function, flock $fh,2 is a blocking call. In C there are other kinds of calls that would not "block". So your version #1 is ok given your additional description. You are doing this right: waiting for exclusive access, do something and then release the lock.

      Yes! Closing LOG will release the lock.
      You have to have a file open to have a lock!
      This looks like could be a problem in Version #2

      To reduce memory usage, I understand that you want to process the LOG file line by line instead of making an in memory copy of it? ala: my @logfile=<LOG>;

      Ok, there is going to be a tradeoff between performance and memory usage. However, it sounds like the performance is secondary to memory usage (and of course both are less important than file integrity). However, I'm not sure that the first 2 matter - how big is this LOG file?

      If I have a lock on the LOG file, closing the LOG file will release the lock on it but that is an issue if this is being used for IPC coordination and you are replacing it with another file (the temp file).

      We want to: open the LOG file, process it line by line, write results to a temp file, then replace LOG with that temp file. To replace the log with the temp file (rename), you will have to close it but, closing that file would release the IPC coordination lock.

      One solution to this is to have a one byte length "FLAG" or coordination file ( I think even zero bytes is ok). Do a blocking wait for an exclusive lock on that file, then do whatever you want with LOG. In this case, there only needs to one "temp" file and it's name doesn't matter much.

      Untested, but the general idea. You will have to add code to make sure that "flagfile" actually exists. It can even be just one byte (or I think in this case even zero bytes) - contents are meaningless.

      open (FLAG, '<', "flagfile") or die "cannot open flag $!"; flock FLAG, LOCK_EX or die "cannot lock file! $!"; #mode 2 # you have an exclusive lock to FLAG here.. # by cooperation convention, I that means I have exclusive # access to both the LOG file and the TEMP file # other processes don't update or use either open (LOG, '<', $file) or die "cannot open $file $!"; open (TEMP, '>', "tempName") or die "cannot open tempName $!"; # there will only be one "tempfile" at a time, so the name # doesn't matter much. Add your program name into it so that it is # unique amongst other processes and I think that's all you need # i.e. no rand() required. while (<LOG>) { # process each line in LOG print TEMP "whatever"; } close LOG; # this would release the lock # but we are using a different lock # for IPC coordination. unlink $file; # I think necessary for the rename of the temp # file to the log file's name. close TEMP; rename "tempName", "$file"; # "LOG" replaced with # the edited version close FLAG; #releases lock.
      Update: Now that I understand the application better, I would definitely be thinking in terms of Grandfather's suggestion to use a DB. I've become quite enamored with SQLite because it doesn't have all of the admin baggage that a "real SQL server" has (and I do have an SQL daemon running on my machine - so I know at least something about the hassle this involves).

      However, the above is fairly simple and will work well and fit in with the OP's current code.

      I liked your questions and I up-voted that post. You added some additional information that helped me a lot to understand what you are doing.

      The basic issue here is: how make updates to this shared file "atomic" - meaning works or doesn't work and no partial updates allowed.

      Instead of "locking" the LOG file, if you want to essentially delete that file and replace it with another file. You need a lock on something else for coordination because the lock on the LOG file will disappear when you delete it (and I think you need to do that in order to replace it with the TEMP file via rename). No lock on the TEMP file is needed because there will only be one temp file at a time. And a "read lock" on the LOG file does no good. We need to gain exclusive access to this critter and then update it.

      I've tried hard to explain this. Let me know what isn't clear. And of course, its always possible that I've made some error. So please let us know how this works out!

        I'm sure that DB is the best solution, but this script is running on very restricted environment where is no access to SQL or even to CPAN modules. So I have to code it in pure Perl.

        LOG file is (potentially) heavily accessed by numerous script instances. About 90% of times it's READ access, and 10% is READ-MODIFY-OVERWRITE process. The code we are discussing here is READ-MODIFY-WRITE part of the program. Actually i used "flock LOG, 1" because I wanted to let other instances have read-only access to the LOG (even if it's contents is outdated).

        It's good idea about "flag" file. I think it would finally resolve a problem with possible file corruption. But it adds one more file operation and may potentially affect performance. So, I'm going to experiment little bit and benchmark different versions of this code to see what is the best.

        By the way, I'm still in doubt if lock established by "flock LOG, 1" will be removed by rename() operation. And this is very important to know. When i wrote Version#2, I supposed that rename() operation uses system functionality to physically overwrite LOG with TEMP (i.e. doesn't interfere with flock). And, since LOG is opened as read-only, and flock function is virtual and affects only cooperating scripts, there is probably a chance that involved scripts will continue to obey this LOCK until it's unlocked by close() operation, even if the file was physically overwritten.
        Are these assumptions mistaken?