Opening file and checking for data

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Opening file and checking for data by tcf22 (Priest) on Jul 03, 2003 at 13:24 UTC
Grep would probably be a better way to do this. `my $count = grep /match/, <DATA>; print $count; __DATA__ match123123 match3123 nomat not34234 match 4232434` [download] This outputs 3. or my golfing solution(a little shorter): `print scalar(grep /match/, <DATA>);` [download] Update: Fixed my grammar.	[reply] [d/l] [select]
Re: Opening file and checking for data by gellyfish (Monsignor) on Jul 03, 2003 at 13:33 UTC
Well of course benchmarking may prove me wrong but I would have thought that slurping the contents into the array and then iterating over the array is going to be less efficient than simply reading the file line by line like: `... while(<DATA>) { ... }` [download] You probably also want to avoid using DATA as a filehandle as this is a predefined handle setup when Perl initializes pointing to the the stuff after an __END__ or __DATA__ at the end of the program. It doesn't break anything but might be confusing to someone reading the program later. /J\	[reply] [d/l]
Re: Re: Opening file and checking for data by hardburn (Abbot) on Jul 03, 2003 at 14:09 UTC
I haven't done benchmarking, either, but I've seen them in the past. Slurping is by far faster than processing line-by-line. Reason is that doing I/O in a single operation is faster than doing a bit at a time, since you don't have to worry about things like resetting the drive head to the correct position. Naturally, you have to worry about memory limitations. There is really a limited number of cases when slurping is worth it. If your file is small enough to fit in memory, you won't see much speed gain. If it's too large, you'll end up swapping to the hard disk and will thus lose any benifits from slurping. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply]
Re: Re: Re: Opening file and checking for data by sgifford (Prior) on Jul 03, 2003 at 19:20 UTC
Perl's I/O is buffered, so it does one I/O for every disk block regardless of which method you use. `#!/usr/bin/perl if ($ARGV[0] eq 'line') { print "Line-at-a-time\n"; while (<STDIN>) { print if /perl/; } } else { print "All at once\n"; my @arr = (<STDIN>); foreach (@arr) { print if /perl/; } }` [download] On my system, the block size is 4096 bytes. On an 8K file with 128 lines, we see: $ strace -e read /tmp/t29 line </tmp/t29.8192 >/dev/null ... read(0, "This is a line that contains the"..., 4096) = 4096 read(0, "This is a line that contains the"..., 4096) = 4096 read(0, "", 4096) = 0 $ strace -e read /tmp/t29 slurp </tmp/t29.8192 >/dev/null ... read(0, "This is a line that contains the"..., 4096) = 4096 read(0, "This is a line that contains the"..., 4096) = 4096 read(0, "", 4096) = 0 Still, the diamond operator takes some time to operate, so slurping is still probably faster, but not because of I/O.	[reply] [d/l]
Re: Re: Re: Opening file and checking for data by CountZero (Bishop) on Jul 03, 2003 at 16:08 UTC
And of course the caching algoritms of your hard-drive and OS will have a big influence as well. If the file is small enough to be read in one go, slurping will not have any big added speed-benefit, but will increase memory load. And if the file is rather large, it may crowd out other items in your cache and slow down other programs: TANSTAAFL, as Heinlein said. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Opening file and checking for data by halley (Prior) on Jul 03, 2003 at 13:58 UTC
Nobody so far has mentioned your /g option on the regular expression check. If you're just looking for a yes/no on each line, then you can drop the /g. If you want to count multiple occurrences in the same line (such as 'l's in 'hello wally') then you will need to do something like catch m/.../g into an array, then count those elements. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Opening file and checking for data by Zaxo (Archbishop) on Jul 03, 2003 at 14:46 UTC
To count multiple matches in a line, you can realize that m//g returns the matches in array context. That means that you can just say `$counter += () = /patternmatchinghere/gi;` for the counting. I doubt if your error log is appended to your script after __END__ or __DATA__, so you should avoid the *DATA handle. My rewrite: `my $counter = 0; open LOG, "< $errlog" or die 'Can not open: ', $!; $counter += () = /patternmatchinghere/gi while <LOG>; close LOG;` [download] I've omitted opening $errlog to append, since I don't see what it does for you in this snippet. Update: changed for to while to avoid slurping. After Compline, Zaxo	[reply] [d/l] [select]
Re: Re: Opening file and checking for data by Anonymous Monk on Jul 03, 2003 at 14:59 UTC
Thanks for all the replies.	[reply]
Re: Opening file and checking for data by DBX (Pilgrim) on Jul 03, 2003 at 14:07 UTC
If you are using a string in your regular expression that will not change on each iteration of your loop, consider adding the /o operator like so: `if($_ =~ /patternmatchinghere/gio)` [download] This will compile the regular expression only once, as opposed to once for every loop iteration. On a large amount of data, this will speed up your code significantly.	[reply] [d/l]
Re: Re: Opening file and checking for data by hardburn (Abbot) on Jul 03, 2003 at 14:19 UTC
/o is dead, long live qr//! ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer Note: All code is untested, unless otherwise stated	[reply]
Re: Re: Re: Opening file and checking for data by DBX (Pilgrim) on Jul 03, 2003 at 14:42 UTC
Good point. I honestly forget about qr// most of the time, but this was a good reminder.	[reply]
Re: Opening file and checking for data by hmerrill (Friar) on Jul 03, 2003 at 13:40 UTC
I can't see anything wrong with your code, and 'tcf22's example looks fine too. Although I've never used it, for comparing which of a few different ways is most efficient, the 'Benchmark' module might be helpful. Do `perldoc Benchmark` [download] at a command prompt to see how to use it. HTH.	[reply] [d/l]
Re: Re: Opening file and checking for data by hmerrill (Friar) on Jul 03, 2003 at 13:46 UTC
In the Perl Cookbook recipe 3.9 'High-Resolution Timers' looks pretty useful for timing code - it uses the Time::HiRes module which comes standard in Perl 5.8: `use Time::HiRes qw(gettimeofday); $t0 = gettimeofday; ### your code here ### $t1 = gettimeofday; $elapsed = $t1 - $t0; # elapsed is a floating point value, representing the # number of seconds between $t0 and $t1` [download] then do the same thing for the other way, and see which one takes the least amount of time.	[reply] [d/l]
Re: Re: Re: Opening file and checking for data by tomhukins (Curate) on Jul 03, 2003 at 13:57 UTC
If you want to benchmark your code, use Perl's Benchmark module. There's no need to reinvent it. Super Search should find numerous examples of how to use it. Update: I see you have already mentioned Benchmark. In future, I must read more carefully before repying. Bad me.	[reply]
Re: Opening file and checking for data by msemtd (Scribe) on Jul 04, 2003 at 13:20 UTC
I'm surprised that nobody yet mentioned not using Perl at all! What you want to do is possible with Gnu grep from the command line... `grep --count thepattern whateverfile` [download] Of course, this may not meet your requirements - notably, this returns a count of the lines matched per file rather than pattern matches per file. YMMV	[reply] [d/l]