awk 2 perl oneliner

space_agent has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow seekers of perl wisdom,

To me the following awk code is a piece of art. I got it from
http://www.unix.com/shell-programming-scripting/130349-fgrep-grep-awk-help-scanning-delimiters.html
I wonder if it is possible to make an equally easy to understand and fast perl oneliner:

awk 'NR==FNR {a[$1];next} $1 in a {print RS$0}' file1 RS=">" file2 > output

cat file1 
name1
name2
name3

cat file2
>name3
text text
text text
>name1
some kind 
of text

cat output
>name1
text text
text text
>name3
some kind
of text
[download]

The task is to extract the names that are present in file1 from file2 plus all the text that follows it until the next ">" character. Speed is of importance as files may well be several hundred MB big. I would be eager to hear any comments on this.

cheers space_agent

Comment on awk 2 perl oneliner Select or Download Code

Replies are listed 'Best First'.
Re: awk 2 perl oneliner by almut (Canon) on Mar 05, 2010 at 22:15 UTC
Not a oneliner, but maybe slightly better readable... `#!/usr/bin/perl # create names lookup table from first file while (<>) { $names{$_} = 1; last if eof; } $/ = "\n>"; # set input record separator # scan second file while (<>) { print if /^>?(\w+\n)/ && $names{$1}; }` [download] Usage: `$ ./extract.pl file1 file2 >output` [download] Update: you could of course also make a oneliner out of it :) `$ perl -ne'$/eq"\n"?$nm{$_}++:/^>?(\w+\n)/&&$nm{$1}&&print;$/="\n>"if +eof' file1 file2 >output` [download] Or (approaching the realms of golfing and obfu): `$ perl -ne'1..eof&&($/=">")?${$_}++:/.+\n/&${$&}&&print' file1 file2 > +output` [download]	[reply] [d/l] [select]
Re^2: awk 2 perl oneliner by Anonymous Monk on Mar 06, 2010 at 18:14 UTC
Thanks a lot, your script is great. Your solution is also a lot faster than the awk oneliner. I modified it a bit to fit my real data. `#!/usr/bin/perl # create names lookup table from first file while (<>) { #get rid of newline in name because there might be #something like >name1\stext\n in the second file chomp; $names{$_} = 1; last if eof; } $/ = "\n>"; # set input record separator # scan second file print ">"; # print first ">" that would be missing while (<>) { print if /^>?([\w\d]+)/ && $names{$1}; }` [download] The only prob I have now is, when applied to my real data a ">" sign is missing at the beginning (which I got around with a print before the second while loop) and one is too much at the end.	[reply] [d/l]
Re^3: awk 2 perl oneliner by almut (Canon) on Mar 06, 2010 at 20:16 UTC
The only prob I have now is, when applied to my real data a ">" sign is missing at the beginning ... and one is too much at the end. The reason for this is the way the records are being split. The first record (and only the first) will have a leading ">", and all records except for the last will have a trailing ">". Now, if you happen to not print out the first record (because it doesn't match) the initial ">" will be missing from the output file. Similarly, if you happen to not print the last record, there will be a stray ">" from the previous record. The proper way to handle those corner cases would be to always remove any leading or trailing ">"s, and always add a new one for output: `... # scan second file while (<>) { s/^>//; # remove leading '>' s/>$//; # remove trailing '>' print ">$_" if /^([\w\d]+)/ && $names{$1}; # add ^ }` [download] (This way, you also no longer need to allow a leading ">" in the regex (the `>?` part). )	[reply] [d/l] [select]
Re^4: awk 2 perl oneliner by space_agent (Acolyte) on Mar 07, 2010 at 12:57 UTC
Re: awk 2 perl oneliner by Marshall (Canon) on Mar 06, 2010 at 09:36 UTC
I am not an awk expert. But from your problem statement, you have file 1 with some names. Then another file 2 with some data. The output swaps the data associated with >name1 and >name3. Can you show what would happen if there was say, some >name2 or some >name4 data there? I don't find: "awk 'NR==FNR {a$1;next} $1 in a {print RS$0}' file1 RS=">" file2 > output" easy to understand.	[reply]
Re^2: awk 2 perl oneliner by zwon (Abbot) on Mar 06, 2010 at 10:53 UTC
The output swaps the data associated with >name1 and >name3. Note, that the output shown in OP is incorrect. Actual output will be: `>name3 text text text text >name1 some kind of text` [download] So it doesn't swap anything.	[reply] [d/l]
Re^3: awk 2 perl oneliner by space_agent (Acolyte) on Mar 06, 2010 at 11:18 UTC
Right, my mistake. Though the order of output is not important to me in this case.	[reply]