Sounds like a double parse is your best bet then. i.e. use something like the above and then trap for exceptional lines and reparse them to split them again.
Corruption is giving you the real headache. Why is the data so corrupt? It sounds like you are using PDFs generated from an OCRs. Have you looked at
Tesseract. I have used that when I have needed to train a system to handle OCRing and improving the quality of your source data is always an option.
UnderMine