Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have an output file of a program, where I can have 2 possibilities:
If I do not have any results that can be manipulated, all the lines of the output file will start with #, else, there will be some lines that will not start with #.
I was wondering if there is a way, without maybe having to read every line in the file, to quickly see if there exist lines that do not start with # (which would mean that I can process this file further.

Replies are listed 'Best First'.
Re: A fast way to do this?
by AppleFritter (Vicar) on Jul 13, 2014 at 23:50 UTC

    I was wondering if there is a way, without maybe having to read every line in the file, to quickly see if there exist lines that do not start with # (which would mean that I can process this file further.

    Not really (but may the other monks correct me if I'm wrong). Since you can't know in general where a new line starts without having read all of the preceding one, you cannot know whether any line starts with a hash mark unless you read them all.

    This is for the general case, of course. If the output of your program is further constrained, inferences may be possible; for instance, if you know that every line is precisely 70 characters (including the trailing newline), you could read every 70th character only. Whether that'd be much faster in practice is another question, one that only profiling could answer.

    Is there any way that the program generating this output can signal to you whether there's data that needs further processing? If you can change that one's implementation, perhaps have it quit with an appropriate exit code, or drop a marker file, or something along those lines.

Re: A fast way to do this?
by davido (Cardinal) on Jul 13, 2014 at 23:52 UTC

    How big are the files? How often are you processing such files? How many of them are there? How tight is the time constraint on arriving at a result? How often do they change?

    In short, what problem are we really solving?

    There is no silver bullet when it comes to ascertaining where a given character or pattern is found within a file; one must look at the file's contents to find out. But if we knew more about the problem we're trying to solve, we might be able to come up with sensible and efficient solutions.


    Dave

Re: A fast way to do this?
by perlfan (Parson) on Jul 14, 2014 at 02:18 UTC
    Shell + Perl solution:
    #!/bin/sh INFILE=test.in # get number of lines not matching "^#" (-v inverses the results) non_comment_lines=$(cat $INFILE | grep -cv "^#"); if [ 0 -lt $non_comment_lines ]; then perl ./process_file.pl < $INFILE fi
    For a Perl-only solution, slurp in the file contents and use a regex:
    #!/usr/bin/env perl use strict; use warnings; $/=undef; open my $fh, "<", "test.in"; my $file = <$fh>; if ($file =~ m/^#/) { print "process me!\n"; }
Re: A fast way to do this?
by gurpreetsingh13 (Scribe) on Jul 14, 2014 at 05:08 UTC
    Use external grep over the file. That would be fastest in this case and use its result.
    use v5.14; my $noOfLines = `grep ^# testFile|wc -l`; chomp($noOfLines); if ($noOfLines){ <doSomething> }
      Thanks a bunch guys for your help!
      This is sort of the third option, but there is no reason to shell out to grep - let alone grep then pipe to wc. Look at grep's "-v" and "-c" options.

        Sometimes a handy approach is to use a command like grep to quickly identify files that need to be processed, then to pipe that to xargs (possibly using the -P numberOfProcesses option) to execute “a very simple command” against each of them.   In this case, you are looking for, say, /^\s*^#/ or something like that.   (“At start-of-line, zero or more whitespace characters followed by a character that is not a hashmark.”)

        The complexities of deciding whether to invoke a command, and against which files, has been pushed out to the Shell, which invokes the specified command (with a filename as a parameter) only on those files which match the criteria sought.   Of course, the program should not blindly assume that it was invoked under the correct conditions ... it should check ... but even so, this is a very powerful approach that is applicable in a lot of situations.

        (After so-many years of pretending that an CP/M-era shell was good enough, Microsoft finally came up with PowerShell, which is useful for these things also.   Although it is not, of course, compatible with anyone else.)