in reply to some bioinformatics

Hi hgraf;

That's a regular expression, and it's probably the most important bit of your code. The bits in parens are the capture groups; they're $1, $2, ...

\S+ means 1 or more non-whitespace character(s), and \s* is 0 or more whitespace. It depends on the format of your data, but repeated (\S+)\s*(\S+)\s* sections may be enough.