The thing that isn't stated by the OP is how large the templates are. If the string is relatively short then doing the multiple regex passes is fine. The longer the template becomes - if he wants it to be fast - he really needs to try and get the number of passes as low as he can.
The final question is - is this swapping bit the bottle neck of the process. If the OP is running these as CGI processes - then it probably doesn't matter what algorithm he is using because the startup time of the CGI is going to be costly -- that is unless the templates being used are mega bytes in size.