in reply to Massive File Editing

Hi Kage,
I know the others are suggesting moving towards using file::find or some other function, which is a great idea if you can install the modules on the end host. I recently had a similar problem where I could not easily install modules, long story, and had to write the function from scrap. What I changed is that I feed in a ls -laR listing into the program to parse out the files I wanted to use and then modified those files.

Example:

# Try to match the expected input line format (from "ls" output) # if ("$_" =~ /^\-.+ ([0-9]+) ([A-Z|a-z]+ [ ]?[0-9]+ [ ]?[:|0-9]+) (.+) +$/) { # Set some defaults to avoid potentially problematic missing field +s # $file1 = "FULLNAME"; $file2 = "BASENAME"; $fext = "NO EXTENSION"; # Set file size, date and compelte filename variables $fsize =$1; $date = $2; $file1 = $3; if ("$file1" =~/^([\.]?.+)\.(.+))$/) { $file2 = $1; $fext = $2; }
Then for your example you would test to see if the file extension was .shtml and if it was open the file and read it, whether to read in the file as a glob or line really depends on two issues;

1) How many times do you plan to run this, let's be honest if your only going to run this once you don't need a perfectly efficient piece of code. even though I hate to admit that.

2) How many files and the size of the files you'll be reading in.

Then as you hit the line you could either do a s// or just replace the contents of the substring. I like to cheat with a sanity test since I substitute operations seem to always do bad things to me data.

if ($_ =~ /<a href="main.php?page/) { s/main.php?page=/main.php/?id=/g }
then the file open operator should be pretty straight foward (no more directory recursion woo!), if you have some problems with the output of the ls statement you may have to embedded a directory. Other than that it should be pretty straight foward. I had to write this to deal with a terrabyte file system in a lawsuit, so my solution may require more work then you are willing to deal with.

Dave -- Saving the world one node at a time

Replies are listed 'Best First'.
Re^2: Massive File Editing
by Aristotle (Chancellor) on Dec 16, 2002 at 19:14 UTC

    Bad advice.

    File::Find has been part of the core Perl distribution forever. If it isn't available on your host, it means their installation of Perl is incomplete. Complain to them and if they don't react, move to somewhere else. There is no excuse for not offering File::Find.

    I feed in a ls -laR listing
    How robust is your ls parsing pattern? And why not use find for the job of find? Something like the following does all you want, with minimal coding of your own. $ file . -type f -name '*.shtml' -print0 | xargs -0 ./myscript.pl Iterate over @ARGV using the diamond operator; it might even suffice to do something like $ file . -type f -name '*.shtml' -print0 | xargs -0 perl -i.old -pe's!/main\.php\?page=!/id=!g' See perldoc perlrun. Use the tools intended for your job to do your job, don't reinvent round wheels.

    Makeshifts last the longest.

      Hi Aristotle,

      Had no idea File::Find was part of the core distribution. The reason I did not use a find function in my example was I had to parse each file in the filesystem.

      Why? We had a 2 terrabyte file system from a litigation that we needed to type, index, hash and store in a mysql database. Then when we needed to find files of certain types, patterns, sizes, dates we could query a hashed index instead of running find each time. I agree that if you are looking for a certain type of file this is not the best idea, however in my situation it had to be done. However, if you know of a better way to do this, PPLLEEASSEE let me know.

      Not the best advice, but it worked for me.

      Dave -- Saving the world one node at a time

        I see your point - and I concur that using a database was the better choice in this case (locate operates much in the same way, f.ex). Though I'd still use File::Find or at least find instead to scan the filesystem - the pertinent file information can more robustly be retrieved by (lstat|stat)ing the files yourself rather than parsing ls' output. In general, the less parsing you do, the better.

        Makeshifts last the longest.