raghu_shekar has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a perl script and have hit a roadblock on my way. I am relatively new to perl. i have a command which is a combination of many commands joined by pipes. it basically looks for a particular id in a huge file. when i run the command on the command line it works perfectly fine but when i put it in the script, it cats the whole file instead of just the id that i need. I have used backtics, exec, system... nothing seems to work. Any help will be greately appriciated. i am just not able to over come this. just to have a clear idea here s the code snippet: grep 'DataDictionary' $file | awk -F'<pciOFACViolation>' '{print $1}' | awk '{print $3}' | awk -F'>' '{print $1}'..... $file is the file to be searched.

Replies are listed 'Best First'.
Re: commands with multiple pipes in perl
by ELISHEVA (Prior) on Mar 17, 2009 at 07:33 UTC

    If that is your above command,you might want to consider a pure Perl implementation. It appears that you are filtering out lines (grep) and then using a series of calls to awk to split the input into fields and subfields. All of this can be done quite easily in Perl four or five lines of Perl (maybe less) using a regular expression and maybe split. A pure Perl implementation is likely to be much faster as you will only need a single process rather than the 4 you are currently using in your pipe.

    The following sample code illustrates how grep and awk can be mapped to Perl constructs. It is a lot more verbose than necessary because I've assigned things to variables to make it clearer exactly what is going on. The real production code could easily be mushed down to no more than four lines inside the while loop and possibly even down to one line (print if regex matches) if splits are replaced by a capturing regular expression:

    while(my $line = <DATA>) { #grep 'DataDictionary' next unless $line =~ /DataDictionary/; #awk -F'<pciOFACViolation>' {print $1} my @aFields = split(/<pciOFACViolation>/, $line); my $sFieldICareAbout = $aFields[0]; #$1 in awk #awk '{print $3}' @aFields = split(/\s/, $sFieldICareAbout); $sFieldICareAbout = $aFields[2]; #$3 in awk #awk -F'>' '{print $1}' @aFields = split(/>/, $aFields[2]); $sFieldICareAbout = $aFields[0]; #$1 in awk print "$sFieldICareAbout\n"; } __DATA__ *** *** G1>H>I<pciOFACViolation>DataDictionary Whan that aprill with his shoures soote The droghte of march hath perced to the roote, And bathed every veyne in swich licour Of which vertu engendred is the flour; *** *** G2>H>I<pciOFACViolation>DataDictionary Whan zephirus eek with his sweete breeth Inspired hath in every holt and heeth Tendre croppes, and the yonge sonne Hath in the ram his halve cours yronne, And smale foweles maken melodye, That slepen al the nyght with open ye (so priketh hem nature in hir corages); *** *** G3>H>I<pciOFACViolation>DataDictionary Thanne longen folk to goon on pilgrimages, And palmeres for to seken straunge strondes,

    The one liner (print if regex) depends heavily on the exact format of each line, particularly the placement of "DataDictionary". To give you a feel for its succinctness, here is the one-line code for the above format of DataDictionary lines.

    while(<DATA>) { print "$1\n" if /^\S+\s+\S+\s+([^>]+).*<pciOFACViolation>.*DataDictionary/; }

    If you are interested in this approach, perhaps you could give us a few sample lines containing "DataDictionary"?

    Best, beth

    Update: Added code illustrating mapping of grep and awk to Perl constructs.

    Update: Added more succinct example using one line (print if regex).

      Hi, but the approach you have mentioned here takes a long time as it searches extensively. when i run the entire command from the command line it gives me the output in seconds but when i put it in the script it takes a long time and i had to end the script.even close to 5minutes and no output.. I also reduced the file size and tried still takes a long time making me wonder if the script has hung

        Curious. It shouldn't be doing any more searching than grep would, assuming you are reading in one line at a time. (if you slurped the file in as one long line that could slow you down a lot). How did you adapt the above code for your situation? Perhaps if you posted the code we might have a better idea of why your program is so slow.

        When I created a dummy file with the data above repeated 10,000 times (equivalent to a 6.6M file) parsing took only 0.71 seconds (wall clock time). When I upped the size by repeating the file 1,000,000 times (equivalent to a 660M file, more than half a gigabyte) it took 26 seconds.

        Best, beth

Re: commands with multiple pipes in perl
by moritz (Cardinal) on Mar 17, 2009 at 07:16 UTC
    Please show us what you've tried, and tell us in which way it doesn't work.

    You have to be careful which variables you interpolate, and which you don't:

    my $file = 'stuff'; system qq[grep 'DataDictionary' $file | awk -F'<pciOFACViolation>' '{p +rint \$1}'|...]; # ^ interpolate + ^ don't interpolate

    Update: see a similar discussion we have at the moment.

      the code that i have shown is the exact one that i am trying.. it works perfectly fine when i execute from the command line but when i put in the script it just cats the file and displays the which i am storing in an array.
        the code that i have shown is the exact one that i am trying..

        You can't just execute shell code in perl as if perl were the shell. You need something like system. So show us the perl code you've tried. And try the example I wrote.

        and displays the which i am storing in an array.

        This sentence makes no sense to me.

        A reply falls below the community's threshold of quality. You may see it by logging in.