space_agent has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow seekers of perl wisdom,

To me the following awk code is a piece of art. I got it from
http://www.unix.com/shell-programming-scripting/130349-fgrep-grep-awk-help-scanning-delimiters.html
I wonder if it is possible to make an equally easy to understand and fast perl oneliner:

 awk 'NR==FNR {a[$1];next} $1 in a {print RS$0}' file1 RS=">" file2 > output
cat file1 name1 name2 name3 cat file2 >name3 text text text text >name1 some kind of text cat output >name1 text text text text >name3 some kind of text

The task is to extract the names that are present in file1 from file2 plus all the text that follows it until the next ">" character. Speed is of importance as files may well be several hundred MB big. I would be eager to hear any comments on this.

cheers space_agent

Replies are listed 'Best First'.
Re: awk 2 perl oneliner
by almut (Canon) on Mar 05, 2010 at 22:15 UTC

    Not a oneliner, but maybe slightly better readable...

    #!/usr/bin/perl # create names lookup table from first file while (<>) { $names{$_} = 1; last if eof; } $/ = "\n>"; # set input record separator # scan second file while (<>) { print if /^>?(\w+\n)/ && $names{$1}; }

    Usage:

    $ ./extract.pl file1 file2 >output

    Update: you could of course also make a oneliner out of it :)

    $ perl -ne'$/eq"\n"?$nm{$_}++:/^>?(\w+\n)/&&$nm{$1}&&print;$/="\n>"if +eof' file1 file2 >output

    Or (approaching the realms of golfing and obfu):

    $ perl -ne'1..eof&&($/=">")?${$_}++:/.+\n/&${$&}&&print' file1 file2 > +output

      Thanks a lot, your script is great. Your solution is also a lot faster than the awk oneliner. I modified it a bit to fit my real data.

      #!/usr/bin/perl # create names lookup table from first file while (<>) { #get rid of newline in name because there might be #something like >name1\stext\n in the second file chomp; $names{$_} = 1; last if eof; } $/ = "\n>"; # set input record separator # scan second file print ">"; # print first ">" that would be missing while (<>) { print if /^>?([\w\d]+)/ && $names{$1}; }
      The only prob I have now is, when applied to my real data a ">" sign is missing at the beginning (which I got around with a print before the second while loop) and one is too much at the end.

        The only prob I have now is, when applied to my real data a ">" sign is missing at the beginning ... and one is too much at the end.

        The reason for this is the way the records are being split. The first record (and only the first) will have a leading ">", and all records except for the last will have a trailing ">". Now, if you happen to not print out the first record (because it doesn't match) the initial ">" will be missing from the output file. Similarly, if you happen to not print the last record, there will be a stray ">" from the previous record.

        The proper way to handle those corner cases would be to always remove any leading or trailing ">"s, and always add a new one for output:

        ... # scan second file while (<>) { s/^>//; # remove leading '>' s/>$//; # remove trailing '>' print ">$_" if /^([\w\d]+)/ && $names{$1}; # add ^ }

        (This way, you also no longer need to allow a leading ">" in the regex (the >? part). )

Re: awk 2 perl oneliner
by Marshall (Canon) on Mar 06, 2010 at 09:36 UTC
    I am not an awk expert. But from your problem statement, you have file 1 with some names. Then another file 2 with some data. The output swaps the data associated with >name1 and >name3. Can you show what would happen if there was say, some >name2 or some >name4 data there?

    I don't find: "awk 'NR==FNR {a$1;next} $1 in a {print RS$0}' file1 RS=">" file2 > output" easy to understand.

      The output swaps the data associated with >name1 and >name3.
      Note, that the output shown in OP is incorrect. Actual output will be:
      >name3 text text text text >name1 some kind of text
      So it doesn't swap anything.
        Right, my mistake. Though the order of output is not important to me in this case.