Changing data in alot of files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Changing data in alot of files by Excalibor (Pilgrim) on Jul 10, 2003 at 17:11 UTC
Hi there! You definitely want Jason-Mark Dominus's Tie::File CPAN module (part of the CORE since Perl 5.8.0) An example: `use Tie::File for my $file (@files) { my @file; tie @file, 'Tie::File', $file or die $!; for my $line ( @file ) { # do stuff on the lines you want # it reflects 'inmediately' on the file } untie @file; }` [download] Definitely million times better! best regards, Update: While "perl -wnli.bak -e 's///;s///...' *" will make a backup copy of the files processed so far (if it happens to crash), it will scan the whole file, no matter what. From your example that seems what you need, so it's a very good reply (I am voting for it) but I forgot to add that if you know which lines you are changing (headers, or if they come from a template) with Tie::File you can access individual lines using $file5 (the 6th line). What's more, you can easily cut the loop ig you know all changes will be in the first 50 lines of header, or whatever. Tie::File only reads as many lines as needed to get the job done, and it's very fast (not new files, works inside the file itself) and AFAIK completely reliable. You can get it from CPAN (www.cpan.org) (perl -MCPAN -e 'install Tie::File' if you have it configured to go beyond a proxy; try doing perl -MCPAN -e shell first, and write the 'install Tie::File' there) for Perl 5.6.1 (I have it in production code with this release of Perl on GNU/Linux) or if you're using ActivePerl, try using the tool to install compiled modules in Perl (was it called ppm?). Good luck, -- `our $Perl6 is Fantastic;`	[reply] [d/l] [select]
Re: Re: Changing data in alot of files by Anonymous Monk on Jul 10, 2003 at 17:13 UTC
My NT server is using Perl 5.6.1 so I dont think I can use your example? Is my example the most efficient way of doing what I need to do??	[reply]
Re: Re: Re: Changing data in alot of files by hardburn (Abbot) on Jul 10, 2003 at 17:39 UTC
Most efficient? Not a chance. Doing linear scans of the data like that isn't going to win you any points in the efficiency department. However, it is vastly easier to code then trying to do in-place editing of the file (assuming you need to do this as part of a larger program, so that dragonchild's response wouldn't help you much). If the data is small, you could slurp the file into a single scalar and then run your `s///g`'s. You may or may not see any performance benifit doing it this way, depending on your hard drive cache, IO buffering, OS implementation, phase of the moon, etc. Still, Tie::File probably satisifies both efficency and ease of development. I've never used the module myself, but I don't know of any reason why it shouldn't work on NT. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply] [d/l]
Re: Changing data in alot of files by dragonchild (Archbishop) on Jul 10, 2003 at 17:04 UTC
Use the commandline options. Something along the lines of: `perl -lne 's/aaaaa/FFFFF/gi;s/bbbbb/EEEEE/gi; ...;'` [download] That will do the changing in-place. Look at your Camel book for further reference. ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement. Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Re: Changing data in alot of files by pzbagel (Chaplain) on Jul 10, 2003 at 17:11 UTC
I believe you forgot the '-i' option to change files in place.	[reply]
Re: Changing data in alot of files by Zaxo (Archbishop) on Jul 10, 2003 at 19:01 UTC
This isn't fancy, but it will work, `for (@files) { open OLDDATA, "< $_" or warn 'File does not open: ', $! and next; open NEWDATA, "> $_!" or warn 'File not open: ', $! and next; while (<OLDDATA>) { s/aaaaa/FFFFFFF/gi; s/bbbbb/EEEEEEE/gi; s/cccccc/GGGGGGG/gi; print NEWDATA $_; } close NEWDATA or warn $! and next; close OLDDATA or warn $! and next; rename "$_!", $_; }` [download] This version is not much different from yours. It reads and processes the lines one at a time, instead of slurping to and array of lines. That will save memory by taking a smaller amount and reusing it. Efficient use of memory often increases speed in perl. Doing it that way requires a second file handle to write the lines to, so I opened a temporary file called "$filename!" to hold them. Once a file is done, rename does the file replacement very efficiently. Beware that "$filename!" doesn't already exist. You probably won't have to worry about that if these files have some systemmatic naming, but if it's a problem the -e file test will help. Update: The DATA filehandle is special to inlined data in perl. I changed the name. After Compline, Zaxo	[reply] [d/l]
Re: Changing data in alot of files by BrowserUk (Patriarch) on Jul 10, 2003 at 18:41 UTC
Is this the best way to do that? There is no single answer to your question. There are so many definitions of 'best'. Here are a few possibles. Easiest to code. Quickest to run. Most reliable. Easiest to maintain. Most flexible. And the answer to each of these definitions of 'Best', will depend on many other factors. Some examples How many files? How big are the files? How many changes? How often will the changes need to be made? How reliable does the process need to be? If the process gets interupted by system failure or other unforeseen eventuality, do you need to know which files were processed and which weren't? If some files were re-processed, would this be benign repetition? How fast does the process need to be? If you provide answers to the appropriate subset of these questions for your applications needs, then you may get answers that are truely applicable to you. Even if your goal is pure speed, the best solution for 1000 x 20k files is likely to be completely different to that for 100 x 200k files or 10 x 2MB files. Slurping to an array of lines is rarely, if ever, as quick as slurping to a scalar, but whether slurping to a scalar is a viable option depends very much on the criteria for your search and replacement requirements. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: Changing data in alot of files by Anonymous Monk on Jul 10, 2003 at 18:50 UTC
How many files? Over 50,000 files How big are the files? web page files..html and cold fusion so they a +re not big How many changes? ...about 20,000 changes How often will the changes need to be made? ...just one time and the s +cript takes 20 minutes to run How reliable does the process need to be? ..very reliable If the process gets interupted by system failure or other unforeseen e +ventuality, do you need to know which files were processed and which weren't? If some files were re-processed, would th +is be benign repetition? ..files could be reprocessed and this will b +e run during off hours How fast does the process need to be? ...speed shouldnt be too slow bu +t doesnt have to fast..main thing is what I have now does the job and + has SOME effiency [download] Also what I have now is quick and easy to maintain.	[reply] [d/l]
Re: Re: Re: Changing data in alot of files by BrowserUk (Patriarch) on Jul 10, 2003 at 19:37 UTC
"20 minutes" for a "one-time run during off hours" of "20,000 changes" across "50,000 files", and "quick and easy to maintain". Seems to me that you have an existing, working solution that satisfies your needs, which begs the question: "Why are you asking your question"? :) The only criteria you mention that I don't see being satisfied from your posted code is the reliability. But the simply and effective expedient of copying the files before modification and copying them back once the changes have been completed and verified is so simple and so effective that I would be reluctant to move to a more 'sophisticated' solution. If you aren't keeping your files in a source management DB (CVS or similar), then I would strongly recommend you start doing so, especially with that number of sources. If you do use source control software, for this kind of change I would do a mass extraction, run the script and then mass update having suitable checkpointed, rather than trying to do the extractions and updates file-by-file as a part of the script, but thats a personal thing. Another quick look at your code and given the relatively small size of your files, I would probably slurp to a scalar rather than an array as you can then allow regex to process the whole file in one pass with each regex, which would possibly speed the process a little, though you might then need to be slightly more careful with the construction of your regex and investigate the /s and /m modifiers as well as becoming familiar with the differences between ^ & \A and $ and \Z. If I was really interested in wringing performance out of the process, then I might consider using one thread to slurp the files to scalars and queue them to a second thread to run the regexes and a third thread to write them back to the file system, but given the current state of play with memory leaks from threaded apps, and the not negligable increase in complexity that this would add, I couldn't advise it unless the need for speed was desperate, which it clearly isn't in this case. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: Re: Re: Changing data in alot of files by Anonymous Monk on Jul 10, 2003 at 22:23 UTC