in reply to Re: Trying to optimize reading/writing of large text files.
in thread Trying to optimize reading/writing of large text files.

Hi Marshall!
Thanks for your reply.

> File "locking" is a way for programs to cooperate if they choose to do so.
Only another instances of the same script have access to the file, so I think there will be no problem.

> I don't think that Version#1 works because I don't see an UNLOCK operation.
Actually, Version#1 worked over a year. I thought that close() operation unlocks the file. Isn't that correct?

> my $tempfile = "./tempfile".(rand()*9999);
> is not a good way for this. Ask the OS for a guaranteed
> unique temp file name. When you are finished with it, I would clean it up (delete it)

I guess rename() operation solves the problem by renaming/moving temp file into the original filename. But I'm not sure how does rename() interfere with flock and close(). Does it remove previously made lock, or overwritten file continues to be locked until I close() it?
  • Comment on Re^2: Trying to optimize reading/writing of large text files.

Replies are listed 'Best First'.
Re^3: Trying to optimize reading/writing of large text files.
by Marshall (Canon) on Jan 22, 2012 at 16:10 UTC
    Only another instances of the same script have access to the file, so I think there will be no problem.

    It wasn't clear to me who the "writer" was that put the data into this LOG file to begin with. It appeared that you were processing/modifying some file that some other process was appending to (without further info that's what I figured a LOG file was - the term "LOG" just conjures up that image). Perhaps a better name would be "SHARED_CONFIG" or some such name? What you name things within the program actually matters a lot! The use of the name "LOG" triggered some reflexive brain activity and a lot of assumptions about what this file meant.

    From your further description, it appears that you are using this file as a kind of IPC (inter process communication) between instances of your own program. You read this file in, process it, write it back out.

    flock mode 2 is LOCK_EX (exclusive lock, a "write lock"). From my reading of the Perl function, flock $fh,2 is a blocking call. In C there are other kinds of calls that would not "block". So your version #1 is ok given your additional description. You are doing this right: waiting for exclusive access, do something and then release the lock.

    Yes! Closing LOG will release the lock.
    You have to have a file open to have a lock!
    This looks like could be a problem in Version #2

    To reduce memory usage, I understand that you want to process the LOG file line by line instead of making an in memory copy of it? ala: my @logfile=<LOG>;

    Ok, there is going to be a tradeoff between performance and memory usage. However, it sounds like the performance is secondary to memory usage (and of course both are less important than file integrity). However, I'm not sure that the first 2 matter - how big is this LOG file?

    If I have a lock on the LOG file, closing the LOG file will release the lock on it but that is an issue if this is being used for IPC coordination and you are replacing it with another file (the temp file).

    We want to: open the LOG file, process it line by line, write results to a temp file, then replace LOG with that temp file. To replace the log with the temp file (rename), you will have to close it but, closing that file would release the IPC coordination lock.

    One solution to this is to have a one byte length "FLAG" or coordination file ( I think even zero bytes is ok). Do a blocking wait for an exclusive lock on that file, then do whatever you want with LOG. In this case, there only needs to one "temp" file and it's name doesn't matter much.

    Untested, but the general idea. You will have to add code to make sure that "flagfile" actually exists. It can even be just one byte (or I think in this case even zero bytes) - contents are meaningless.

    open (FLAG, '<', "flagfile") or die "cannot open flag $!"; flock FLAG, LOCK_EX or die "cannot lock file! $!"; #mode 2 # you have an exclusive lock to FLAG here.. # by cooperation convention, I that means I have exclusive # access to both the LOG file and the TEMP file # other processes don't update or use either open (LOG, '<', $file) or die "cannot open $file $!"; open (TEMP, '>', "tempName") or die "cannot open tempName $!"; # there will only be one "tempfile" at a time, so the name # doesn't matter much. Add your program name into it so that it is # unique amongst other processes and I think that's all you need # i.e. no rand() required. while (<LOG>) { # process each line in LOG print TEMP "whatever"; } close LOG; # this would release the lock # but we are using a different lock # for IPC coordination. unlink $file; # I think necessary for the rename of the temp # file to the log file's name. close TEMP; rename "tempName", "$file"; # "LOG" replaced with # the edited version close FLAG; #releases lock.
    Update: Now that I understand the application better, I would definitely be thinking in terms of Grandfather's suggestion to use a DB. I've become quite enamored with SQLite because it doesn't have all of the admin baggage that a "real SQL server" has (and I do have an SQL daemon running on my machine - so I know at least something about the hassle this involves).

    However, the above is fairly simple and will work well and fit in with the OP's current code.

Re^3: Trying to optimize reading/writing of large text files.
by Marshall (Canon) on Jan 22, 2012 at 18:10 UTC
    I liked your questions and I up-voted that post. You added some additional information that helped me a lot to understand what you are doing.

    The basic issue here is: how make updates to this shared file "atomic" - meaning works or doesn't work and no partial updates allowed.

    Instead of "locking" the LOG file, if you want to essentially delete that file and replace it with another file. You need a lock on something else for coordination because the lock on the LOG file will disappear when you delete it (and I think you need to do that in order to replace it with the TEMP file via rename). No lock on the TEMP file is needed because there will only be one temp file at a time. And a "read lock" on the LOG file does no good. We need to gain exclusive access to this critter and then update it.

    I've tried hard to explain this. Let me know what isn't clear. And of course, its always possible that I've made some error. So please let us know how this works out!

      I'm sure that DB is the best solution, but this script is running on very restricted environment where is no access to SQL or even to CPAN modules. So I have to code it in pure Perl.

      LOG file is (potentially) heavily accessed by numerous script instances. About 90% of times it's READ access, and 10% is READ-MODIFY-OVERWRITE process. The code we are discussing here is READ-MODIFY-WRITE part of the program. Actually i used "flock LOG, 1" because I wanted to let other instances have read-only access to the LOG (even if it's contents is outdated).

      It's good idea about "flag" file. I think it would finally resolve a problem with possible file corruption. But it adds one more file operation and may potentially affect performance. So, I'm going to experiment little bit and benchmark different versions of this code to see what is the best.

      By the way, I'm still in doubt if lock established by "flock LOG, 1" will be removed by rename() operation. And this is very important to know. When i wrote Version#2, I supposed that rename() operation uses system functionality to physically overwrite LOG with TEMP (i.e. doesn't interfere with flock). And, since LOG is opened as read-only, and flock function is virtual and affects only cooperating scripts, there is probably a chance that involved scripts will continue to obey this LOCK until it's unlocked by close() operation, even if the file was physically overwritten.
      Are these assumptions mistaken?
        The post above is my post. The session was just expired and I was recognized as "Anonymous Monk" :)

        A little update: I have unmodified Version#2 running for several hours under heavy load (10-15 scripts at once) and there are still no file corruptions. But probably it's just a luck. My question about interference between flock() and rename() is still open...