in reply to Re^6: Split tab-separated file into separate files, based on column name (open on demand) (updated)
in thread Split tab-separated file into separate files, based on column name

> "Sometimes Perl is not the best tool for the job."

OK this a "Jein" situation.

Perl is certainly often not the best tool.

But if it comes to sed and awk it's hard to believe, because Larry meticulously copied all features.

I bet, I could easily translate this given awk script in a one2one fashion to Perl, by encapsulating the open on demand into a short sub.

Just look at perlvar , perlrun and perltrap at all the details given concerning awk. Now the startup argument for short data, where overhead counts ...

... startup isn't the same issue anymore like it was 25 years ago.

To make it matter we need start a script over and over again. The realistic approach in this case is to write a persistent service which doesn't even need to start up.

We are not talking about heavy apps like perltidy which may need a second to initialize.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^7: Split tab-separated file into separate files, based on column name (open on demand)

Replies are listed 'Best First'.
Re^8: Split tab-separated file into separate files, based on column name (open on demand)
by haukex (Archbishop) on Aug 28, 2020 at 17:00 UTC
    Perl is certainly often not the best tool. But if it comes to sed and awk it's hard to believe

    Yes, this was my point.

    I bet, I could easily translate this given awk script in a one2one fashion to Perl, by encapsulating the open on demand into a short sub.

    You don't need to, Larry did that already :-) a2p was part of the Perl core until 5.20, now it lives on CPAN.

    $ a2p 11121118.awk #!/usr/bin/perl eval 'exec /usr/bin/perl -S $0 ${1+"$@"}' if $running_under_some_shell; # this emulates #! processing on NIH machines. # (remove #! line above if indigestible) eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift; # process any FOO=bar switches $FS = ' '; # set field separator $, = ' '; # set output field separator $\ = "\n"; # set output record separator $FS = "\t"; line: while (<>) { chomp; # strip record separator @Fld = split($FS, $_, -1); if (($.-$FNRbase) == 1) { @Fields = split($FS, '', -1); # clear fields array for ($i = 1; $i <= ($#Fld+1); $i++) { $Fields[($i)-1] = $Fld[$i]; } next line; } for ($i = 1; $i <= ($#Fld+1); $i++) { &Pick('>', $Fields[($i)-1]) && (print $fh $Fld[$i]); } } continue { $FNRbase = $. if eof; } sub Pick { local($mode,$name,$pipe) = @_; $fh = $name; open($name,$mode.$name.$pipe) unless $opened{$name}++; }

    Unfortunately, there's apparently a bug in the translator, and the above script needs a s/\$Fld\[\$i\K\]/-1]/g to fix it.

      Unfortunately, there's apparently a bug in the translator, and the above script needs a s/\$Fld\\$i\K\/-1]/g to fix it.
      Patches are welcome!
        Patches are welcome!

        If I had the time at the moment I would look into it more deeply :-/ Though I think that my approach at "fixing" the issue would probably be to try to revert back to the older $[ = 1; using Array::Base instead of adjusting all array indicies...

      I know and I didn't mention a2p on purpose ;p

      As you can see it's producing Perl 4 code.

      I'd implement Pick() differently and this whole script is twice as long as needed.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Did you try the code?

        That's how I discovered the issue... ;-P

        I think the issue is that awk's arrays are not zero based.

        I checked, and older versions of a2p did indeed set $[ = 1; - apparently while having other strange bugs in the output, it looks like to me (at least on my system, the output is strangely chopped up, e.g. "next linine;" or "}tinue {"). After 5.10, a2p dropped the $[ assignment, but added adjustment of array indicies in some places, while not adjusting the array indicies in other places, e.g. for ($i = 1; $i <= $#Fld; $i++) { $Fields[$i] = $Fld[$i]; } became for ($i = 1; $i <= ($#Fld+1); $i++) { $Fields[($i)-1] = $Fld[$i]; }.

        Awk does not have numerically-indexed arrays at all. There is a convention for using digit strings to emulate numeric indexing, and like Perl, Awk will convert numbers to digit strings upon demand, but Awk arrays are Perl hashes.

Re^8: Split tab-separated file into separate files, based on column name (open on demand)
by jcb (Parson) on Aug 28, 2020 at 23:38 UTC
    Larry meticulously copied all features

    Not quite: (quoting perlvar) "Remember: the value of $/ is a string, not a regex. awk has to be better for something."

    There is also a broader (information-theoretic?) issue where Awk can, in some cases, be more concise because it is less powerful than Perl.

    I could easily translate this given awk script in a one2one fashion to Perl, by encapsulating the open on demand into a short sub.

    You probably could, but the Awk script had one other feature that might be some extra code in Perl: Awk's FNR is reset at the beginning of each input file, so that script will correctly process multiple input files given on the command line, extracting the header from each file.

    On the other hand, it also accumulates open files, so if you have enough distinct columns across a multi-file input set, you will run out of file descriptors. :-)

    To make it matter we need start a script over and over again.

    In the case of a one-liner simple enough to be replaced using sed in a shell script, we were talking about running it over and over again. The better answer is usually to rewrite the entire script in Perl, but sometimes a shell script is the right tool for the job, if the job consists almost entirely of running external programs with very little "local" data processing.