A fast way to do this?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: A fast way to do this? by AppleFritter (Vicar) on Jul 13, 2014 at 23:50 UTC
I was wondering if there is a way, without maybe having to read every line in the file, to quickly see if there exist lines that do not start with # (which would mean that I can process this file further. Not really (but may the other monks correct me if I'm wrong). Since you can't know in general where a new line starts without having read all of the preceding one, you cannot know whether any line starts with a hash mark unless you read them all. This is for the general case, of course. If the output of your program is further constrained, inferences may be possible; for instance, if you know that every line is precisely 70 characters (including the trailing newline), you could read every 70th character only. Whether that'd be much faster in practice is another question, one that only profiling could answer. Is there any way that the program generating this output can signal to you whether there's data that needs further processing? If you can change that one's implementation, perhaps have it quit with an appropriate exit code, or drop a marker file, or something along those lines.	[reply]
Re: A fast way to do this? by davido (Cardinal) on Jul 13, 2014 at 23:52 UTC
How big are the files? How often are you processing such files? How many of them are there? How tight is the time constraint on arriving at a result? How often do they change? In short, what problem are we really solving? There is no silver bullet when it comes to ascertaining where a given character or pattern is found within a file; one must look at the file's contents to find out. But if we knew more about the problem we're trying to solve, we might be able to come up with sensible and efficient solutions. Dave	[reply]
Re: A fast way to do this? by perlfan (Parson) on Jul 14, 2014 at 02:18 UTC
Shell + Perl solution: `#!/bin/sh INFILE=test.in # get number of lines not matching "^#" (-v inverses the results) non_comment_lines=$(cat $INFILE \| grep -cv "^#"); if [ 0 -lt $non_comment_lines ]; then perl ./process_file.pl < $INFILE fi` [download] For a Perl-only solution, slurp in the file contents and use a regex: `#!/usr/bin/env perl use strict; use warnings; $/=undef; open my $fh, "<", "test.in"; my $file = <$fh>; if ($file =~ m/^#/) { print "process me!\n"; }` [download]	[reply] [d/l] [select]
Re: A fast way to do this? by gurpreetsingh13 (Scribe) on Jul 14, 2014 at 05:08 UTC
Use external grep over the file. That would be fastest in this case and use its result. use v5.14; my $noOfLines = `grep ^# testFile\|wc -l`; chomp($noOfLines); if ($noOfLines){ <doSomething> } [download]	[reply] [d/l]
Re^2: A fast way to do this? by Anonymous Monk on Jul 14, 2014 at 07:29 UTC
Thanks a bunch guys for your help!	[reply]
Re^2: A fast way to do this? by perlfan (Parson) on Jul 14, 2014 at 14:08 UTC
This is sort of the third option, but there is no reason to shell out to grep - let alone grep then pipe to wc. Look at grep's "-v" and "-c" options.	[reply]
Re^3: A fast way to do this? by locked_user sundialsvc4 (Abbot) on Jul 14, 2014 at 15:39 UTC
Sometimes a handy approach is to use a command like `grep` to quickly identify files that need to be processed, then to pipe that to `xargs` (possibly using the `-P numberOfProcesses` option) to execute “a very simple command” against each of them. In this case, you are looking for, say, `/^\s^#/` or something like that. (“At start-of-line, zero or more whitespace characters followed by a character that is not a hashmark.”)* The complexities of deciding whether to invoke a command, and against which files, has been pushed out to the Shell, which invokes the specified command (with a filename as a parameter) only on those files which match the criteria sought. Of course, the program should not blindly assume that it was invoked under the correct conditions ... it should check ... but even so, this is a very powerful approach that is applicable in a lot of situations. (After so-many years of pretending that an CP/M-era shell was good enough, Microsoft finally came up with PowerShell, which is useful for these things also. Although it is not, of course, compatible with anyone else.)