Wouldn't using look ahead regular expressions be more efficient than going through the file letter by letter? That's what I was trying to do with the regexp in my post.
My script goes through the file line by line, not letter by letter. It should work even for large files where stuffing everything into memory might be a problem.
If you want to compare different solutions, benchmark. Benchmark can help you.