filesystems & Perl

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: filesystems & Perl (updated!) by haukex (Archbishop) on Aug 12, 2016 at 17:34 UTC
Hi harangzsolt33, I'm not sure how Perl implements readdir internally, but if POSIX.1-2008 is anything to go by: ... files may be removed from a directory or added to a directory asynchronously to the operation of readdir(). ... If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified. Although I suspect the behavior would vary between operating systems and file systems, ~~I would be surprised if files that have not been added or deleted were skipped by readdir~~. However, I am wondering why this is important to you? Have you discovered a case where a file is not being reported by readdir, or are you experiencing some other kind of trouble? Are you, for example, copying files from one place to another and are worried that some might be missed? Update: This article on Mac OS X says that on HFS, unlinking files while using readdir() may cause some files to be skipped. If you want to delete all files in a directory, the suggestion is to use rewinddir until all files have been unlinked. Update 2: ... and if you look at rewinddir in perlport, you'll see that method won't work on Win32. On the other hand, search for "readdir" in "The Linux Programming Interface" and on page 354 you'll see the mention that files that have not been added or deleted are guaranteed to be returned. So again, in order to find the best solution, I think you need to describe the problem you're trying to solve some more. Regards, -- Hauke D	[reply]
Re^2: filesystems & Perl (updated!) by harangzsolt33 (Deacon) on Aug 13, 2016 at 03:21 UTC
I want to create a Perl program that records a bunch of information about each visitor to the web page. Kind of like a page counter. So, here is how it would work: A visitor clicks on my web page. Somewhere at the end of the HTML page, a JavaScript program will collect a bunch of info and load my Perl script "disguised" as an image. The Perl program reads QUERY_STRING. Opens a file and appends the info to the file. Closes the file and sends back a tiny 1x1 image to satisfy the browser. I want to be able to see where my visitors come from, their IP addresses, screen resolution, web browser, operating system, exactly what time they clicked on my site, and everything that can be known. I have started learning Perl in May 2016, so I am just a beginner. And this is a learning exercise. I have already written most of the code, but then I realzied something. Wait a minute. I need to think more about the whatifs. It might not work always! :P So, here are some of the issues that I can think of : 1. My script needs to automatically check the size of this ever-growing data file and make sure it doesn't exceed 300 lines. because let's say, I only want to keep info about the last 300 visitors. I don't want to end up with a log file that's 200MB. lol 2. Let's say I am giving away free cars and my web page gets 1000 hits per minute. Now, in order to make sure that I can always get write access to my log file, I am going to store each entry in a temporary text file in a folder. So, thousands of files are going to get created there. Every time someone pulls up my website, there's a new text file on my server! Obviously, I need to delete some of them. I only need to keep the last 300 files. So, every now and then, my script will do garbage collection and get rid of some files. 3. Since I am the admin, I also want to write a Perl program that will read all the temp files that exist at the time and merge them and format them and print me out a nice report. But I am worried about the flexible nature of this folder. Will it mess up readdir? or will I be able to get a full list? Since I do not have such a busy website, there is no way for me to test what would happen. All I can do is think about this in theory and go through all the possible whatifs. :P 4. Another solution might be to use numbered files 001.txt 002.txt 003.txt and so on. So, I would have 300 text files there at all times. And my script would simply pick out and overwrite the oldest one in the list. That way I would not have to delete and create files. Sounds very simple, but it's not! This might lead to data loss, because two or more instances of my Perl script may want to write to the SAME file all at once. :P Help me please! This seems such a simple program, yet it's so difficult. How should I do this? :(	[reply]
Re^3: filesystems & Perl (updated!) by hippo (Archbishop) on Aug 13, 2016 at 09:26 UTC
Use a database and let the many developers who created the DBMS worry about the locking and concurrency issues instead. Since you have your arbitrary limit of 300 records it can use in-memory tables too thus avoiding hammering the filesystem.	[reply]
Re^3: filesystems & Perl (updated!) by haukex (Archbishop) on Aug 13, 2016 at 09:50 UTC
Hi harangzsolt33, Yes, exactly what hippo said - a database can solve all of the problems you named. You'll find many tutorials on DBI on the web, including several on this site. Historically one might have solved this using flock to control read/write access, but that can be finicky and hard to get right (assuming it even works on your OS+FS); so I suggest you spend your time learning some basic DBI instead. Hope this helps, -- Hauke D	[reply]
Re^3: filesystems & Perl (updated!) by FreeBeerReekingMonk (Deacon) on Aug 13, 2016 at 19:34 UTC
for point 2, consider opening a file as "./name.tmp", then write all the lines to it, then close the file so all content gets flushed to disk, and THEN rename it to your valid extension, like .dat with `rename $temporal_name, $realfile_name;` Keeps operations pseudo-atomic for your readdir's.	[reply] [d/l]
Re: filesystems & Perl by davido (Cardinal) on Aug 12, 2016 at 20:22 UTC
"Eternity is a mere moment, just long enough for a joke." -- Hermann Hesse It's best to look at the list of files in a directory as an approximation of reality. Unless you have a lock on a file, and everyone else is respecting the advisory lock, that file can come and go, or be altered at any time, even while you have it opened. From the moment you read the next entry via readdir to the moment you open the file, to the moment you obtain a lock, those brief moments become an eternity of computing-time during which a bad joke can be played on you. The only certainty you have is after you've obtained a lock, and even then only if other processes are respecting the lock. If the directory can possibly change, you'll need to be tolerant of that possibility. As for opendir and readdir in particular, I would not make critical assumptions about when they will be made aware of changes to the underlying directory's contents. Dave	[reply]
Re^2: filesystems & Perl by $h4X4_\|=73}{ (Monk) on Aug 13, 2016 at 08:53 UTC
Dear Mr. Oswald, Great video! ++ Thank you, Flex	[reply]
Re: filesystems & Perl by Marshall (Canon) on Aug 12, 2016 at 21:13 UTC
Since you are operating on a directory where other processes are removing and creating files asynchronously to your Perl script, there are going to be race conditions no matter how readdir() works. Your code will have to handle these edge cases. If readdir() gives you a filename, that filename may not even exist in the directory by the time you get around to using a file test on that filename. Example: in your code `while (my $SUB = readdir $DIR)`, by the time your code gets to the file tests on $SUB (`if (-d $SUB) {}`), that file may not even exist anymore! When you call readdir(), it is traversing a data structure from the file system. In the general case, what happens if a new entry into that struct occurs while you are traversing that struct, is undefined. But what happens even if that "new" file is read by readdir? What is the difference between readdir "just ended" and immediately after that (maybe even, one nanosecond), a new file appears? It could be that if you "miss" a file, that doesn't even matter because you will pick it up on the next run? Since the directory is constantly changing, you are going to have to process it repeatedly. There will no "I'm finished, i.e. done", except you can say "for this instant, 'I am done'". When processing a directory, I usually use readdir() in a list context. `foreach my $file (grep{..condx..}readdir){}` I usually just don't keep traversing the readdir() structure while deleting files. I take a "snapshot" and then process all files in the snapshot, realizing that the "snapshot" is not perfect. There was some mention of rewinddir(). The sequence of close() and opendir() will do the same albeit much slower. The idea is to restart the readdir() structure traversal anew. It would be helpful if you could explain the problem symptoms that you are having with your code. Also. I am curious about these files that "come and go". There is a difference between deleting a file and creating a brand new one vs re-opening a filename for re-write. If there is more than one process mucking with the files in the directory, (adding or deleting), then how are they coordinating their actions?	[reply] [d/l] [select]
Re: filesystems & Perl by $h4X4_\|=73}{ (Monk) on Aug 12, 2016 at 22:12 UTC
Let's say you read the directory, but while you're reading, the next file gets erased. This is called a Race Condition. You can make your own File Lock that stops files from being erased while reading the directory. This one video comes to mind that can explain it better if you have time to view it... YAPC::NA 2016 - Locking with Perl (...hard things should be possible)‎ - David Oswald	[reply]