Let me clarify a bit. Since I can read the documents in the browser I know they contain only text so OCR is not an issue.
I think we have a little communication problem: Sure you can read text displayed in Firefox, because it was rendered from something like <html><body><h1>Hello</h1>. But you can also read text displayed in Firefox that was rendered from something like <html><body><img src="http://www.example.com/pics/hello.gif" alt="">. Your computer can't, at least not as easy as you. To extract the text from the latter, you need OCR.
All the documents follow a similar set of templates but the content changes for each.
Any chance to get access to the data before the template engine creates the PDF? Perhaps as XML, JSON, CSV or even HTML?
Alexander
In reply to Re^5: Mechanize Firefox text Method
by afoken
in thread Mechanize Firefox text Method
by halweitz
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |