Reading only new lines from a file

dsheroh has asked for the wisdom of the Perl Monks concerning the following question:

I have an app which parses web server logs periodically and, to save on repeating work that it's already done, it does a tell on the filehandle at the end of each run, then seeks back to that position on the next run (unless the inode has changed):

open my $fh, '<', $filename or die "Can't open $filename: $!\n";
_restore_offset($filename, $fh);
while (my $line = <$fh>) {
  # do stuff here
}
_record_offset($filename, $fh);
close $fh;

sub _restore_offset {
  my ($filename, $fh) = @_;
  # Get $offset and $last_inode from database
  my $current_inode = (stat $fh)[1];
  return unless $current_inode == $last_inode;
  
  seek $fh, $offset, 0;
} 

sub _record_offset {
  my ($filename, $fh) = @_;

  my $offset = tell $fh;
  my $inode = (stat $fh)[1];
  # Stuff $offset and $inode back into database
}
[download]

This seems to work perfectly on my test system, where apache is mostly idle.

Moving to a more heavily-trafficked server, however, there are issues with the first line read in a new run being incomplete, with the first part of the line missing, presumably because the seek landed in the middle of the line. (I would blame this on log rotation if I weren't already explicitly checking for an inode change to catch that.)

What's the best/most straightforward way to deal with this (without defeating its purpose by always reading the file from the beginning)?

Comment on Reading only new lines from a file Select or Download Code

Replies are listed 'Best First'.
Re: Reading only new lines from a file by snopal (Pilgrim) on Aug 31, 2007 at 21:14 UTC
It seems like your code assumes that at the moment it gets the offset it "must be on a newline", which you can never be certain of. Essentially, you are seeking to the last character written at the moment of the seek, which could be anywhere in a line. == Desire is one product of absence. -- Stephen Opal ==	[reply]
Re^2: Reading only new lines from a file by dsheroh (Monsignor) on Aug 31, 2007 at 22:04 UTC
Good point... Modifying it to do a `tell` after each read and then pass a parameter to _record_offset with the position after the last read which ended in a newline should take care of that.	[reply] [d/l]
Re: Reading only new lines from a file by bruceb3 (Pilgrim) on Sep 01, 2007 at 00:53 UTC
Are you sure that the log rotation isn't being done with a cp isn't of a mv for the log file that is currently being written too? Your code is written with the expectation that the log rotation software is performing the equivalent of an mv command, i.e. copy the file to a new name and the delete the old file. It possible (and better) that the log rotation software is taking the following steps to rotate the logs; $ mv logfile.2 logfile.3 $ mv logfile.1 logfile.2 $ cp logfile logfile.1 $ >logfile For this example, only three old log files are being kept. The last step will truncate the logfile to 0 bytes but doesn't cause a new file to be created so the inode number will not change. This is a better method because the process that is writing to the log file doesn't need to be restarted (or HUP'd). Depending upon the frequency of the monitoring of the log file, just keeping track of the size of the file will allow you to know that the file has been truncated. If the file size is smaller than it was last time, you can be confident that the file has been truncated and you need to display all of it. Of course there is the risk that if the file is being written to quickly, this method won't pick up all of the changes.	[reply]
Re^2: Reading only new lines from a file by dsheroh (Monsignor) on Sep 01, 2007 at 01:32 UTC
That's a mildly annoying possibility (just by virtue of opening the possibility that an attempt to detect rotation could be foiled by, say, a slashdotting following a slow day), although I do see the advantage in doing it that way. Thanks for bringing it up.	[reply]
Re^3: Reading only new lines from a file by bruceb3 (Pilgrim) on Sep 01, 2007 at 01:47 UTC
To detect a log file rotation you could keep track of the inode number of logfile.2 but it would be necessary to examine the inode number more frequently that the log rotation software runs.	[reply]
Re: Reading only new lines from a file by jdporter (Paladin) on Aug 31, 2007 at 22:37 UTC
Assuming each line has something unique in it, such as a timestamp, you should only need to read the file backwards from the end, and stop when you read a line you've read before. Each run records the last line it has read, rather than the `tell` position. The hard part of this has already been done for you by modules such as File::ReadBackwards and File::Bidirectional. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply]
Re^2: Reading only new lines from a file by dsheroh (Monsignor) on Sep 01, 2007 at 01:12 UTC
Based on another recent thread (Emulating command line pipe), it appears that File::ReadBackwards, at least, is rather slow and quick execution is a concern here. File::Bidirectional I don't know enough to comment on.	[reply]
Re: Reading only new lines from a file by FunkyMonk (Bishop) on Aug 31, 2007 at 21:02 UTC
What about File::Tail? I have no experience of the module, but perhaps looking at the source may give an indication of what it (may) do, that you don't.	[reply] [d/l] [select]
Re^2: Reading only new lines from a file by shmem (Chancellor) on Sep 01, 2007 at 09:31 UTC
Given the limitations of File::Tail, for such tasks I just `open my $tailfh, "tail -F $logfile \|" or die "blunze: $!\n";` [download] `tail -F` handles re-opening the logfile on inode change and file truncating. To process the gathered lines in chunks, I write them to another file: `my @lines; while (<$tailfh>) { push @lines, $_; unless (-f $lockfile) { open my $chunkfh, '>>', $chunkfile or die "more blunze: $!\n"; $\| = 1; # or use IO::File and autoflush print $chunkfh @lines; close $chunkfh; @lines = (); } }` [download] Then another process can touch the lockfile, process the lines in the chunkfile, truncate it and remove the lockfile. The line gathering process then flushes its `@lines`. update: if you're wondering what blunze is - that's black pudding (or blood sausage) update 2: moved the open/close inside the loop, changed the open mode to append. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re^2: Reading only new lines from a file by dsheroh (Monsignor) on Aug 31, 2007 at 21:58 UTC
Looking over the File::Tail docs, it appears that it just opens the file once, then keeps it open while watching for new additions. I'm processing the file to the end, exiting, and then re-opening it to get the new stuff later, so not quite the same situation.	[reply]