in reply to Re^2: Extracting Bibliography Citations
in thread Extracting Bibliography Citations

Sounds like a double parse is your best bet then. i.e. use something like the above and then trap for exceptional lines and reparse them to split them again.

Corruption is giving you the real headache. Why is the data so corrupt? It sounds like you are using PDFs generated from an OCRs. Have you looked at Tesseract. I have used that when I have needed to train a system to handle OCRing and improving the quality of your source data is always an option.

UnderMine
  • Comment on Re^3: Extracting Bibliography Citations

Replies are listed 'Best First'.
Re^4: Extracting Bibliography Citations
by Limbic~Region (Chancellor) on Sep 02, 2008 at 17:01 UTC
    UnderMine,
    Actually, I had given up on PDF::OCR - see Re: Extracting content text from PDFs for details because it wouldn't build on my test platform (Win32). If there are seriously better solutions out there, I will give them a whirl - thanks. Unfortunately, the PDFs are what I have to work with and not originals.

    Cheers - L~R