in reply to perl sort versus Unix sort
in unix I cannot configure the input record separator. So if the data comes with embedded \n charcters in the middle then this breaks the sort.
Is it essential that you preserve these "embedded \n" characters? Whatever the answer, if you can get the unix sort process to create exactly the ordering you want, you could use Perl just to "normalize" the records so that they are all single-line and well-behaved when they go through the unix sort. (But unless you're using Gnu sort, you might still have a problem if some of the records end up being too long -- I think some flavors of unix sort may still have a limit of 1024 bytes per line). (update: For some reason, I feared that solaris might be one such limited flavor, but I was wrong -- I could pump lines of >8200 bytes through /usr/bin/sort with no loss of data. Still, if you're not on linux or solaris 8 or better, test it first.)
You are already handling record-based input by setting $/ in perl, so why not try a pipeline like this (I'm not sure if your reference to "164" was a decimal or octal value -- best to use hex and not worry about this ambiguity; I'll guess that you meant decimal):
Or, if you want to preserve these "extra" line-feeds, replace them with some character or string that doesn't naturally occur in the data; then, after the sort is done, do another one-liner to re-convert these back to "\n".perl -pe 'BEGIN {$/="\xa4\n"} s/(?<!\xa4)\n/ /g' | sort ...
|
|---|