I was wondering if there is a way, without maybe having to read every line in the file, to quickly see if there exist lines that do not start with # (which would mean that I can process this file further.
Not really (but may the other monks correct me if I'm wrong). Since you can't know in general where a new line starts without having read all of the preceding one, you cannot know whether any line starts with a hash mark unless you read them all.
This is for the general case, of course. If the output of your program is further constrained, inferences may be possible; for instance, if you know that every line is precisely 70 characters (including the trailing newline), you could read every 70th character only. Whether that'd be much faster in practice is another question, one that only profiling could answer.
Is there any way that the program generating this output can signal to you whether there's data that needs further processing? If you can change that one's implementation, perhaps have it quit with an appropriate exit code, or drop a marker file, or something along those lines.
| [reply] |
How big are the files? How often are you processing such files? How many of them are there? How tight is the time constraint on arriving at a result? How often do they change?
In short, what problem are we really solving?
There is no silver bullet when it comes to ascertaining where a given character or pattern is found within a file; one must look at the file's contents to find out. But if we knew more about the problem we're trying to solve, we might be able to come up with sensible and efficient solutions.
| [reply] |
#!/bin/sh
INFILE=test.in
# get number of lines not matching "^#" (-v inverses the results)
non_comment_lines=$(cat $INFILE | grep -cv "^#");
if [ 0 -lt $non_comment_lines ]; then
perl ./process_file.pl < $INFILE
fi
For a Perl-only solution, slurp in the file contents and use a regex:
#!/usr/bin/env perl
use strict;
use warnings;
$/=undef;
open my $fh, "<", "test.in";
my $file = <$fh>;
if ($file =~ m/^#/) {
print "process me!\n";
}
| [reply] [d/l] [select] |
Use external grep over the file. That would be fastest in this case and use its result.
use v5.14;
my $noOfLines = `grep ^# testFile|wc -l`;
chomp($noOfLines);
if ($noOfLines){
<doSomething>
}
| [reply] [d/l] |
Thanks a bunch guys for your help!
| [reply] |
This is sort of the third option, but there is no reason to shell out to grep - let alone grep then pipe to wc. Look at grep's "-v" and "-c" options.
| [reply] |
Sometimes a handy approach is to use a command like grep to quickly identify files that need to be processed, then to pipe that to xargs (possibly using the -P numberOfProcesses option) to execute “a very simple command” against each of them. In this case, you are looking for, say, /^\s*^#/ or something like that. (“At start-of-line, zero or more whitespace characters followed by a character that is not a hashmark.”)
The complexities of deciding whether to invoke a command, and against which files, has been pushed out to the Shell, which invokes the specified command (with a filename as a parameter) only on those files which match the criteria sought. Of course, the program should not blindly assume that it was invoked under the correct conditions ... it should check ... but even so, this is a very powerful approach that is applicable in a lot of situations.
(After so-many years of pretending that an CP/M-era shell was good enough, Microsoft finally came up with PowerShell, which is useful for these things also. Although it is not, of course, compatible with anyone else.)
| |