Hi Mark,
It seems like you might have experience with text extraction based on your reply to my post. I posted another thread in PerlMonks can you give me any advice on this post below:
I'm looking for someone who might have advice on building a file extraction tool. My company currently uses LAW PreDiscovery to extract text and metadata from files like .msg, .doc, pdf, jpg, gif, etc, etc, etc. This software is an out of the box tool that has many limitations that cause problems for me and other engineers in our group. So, we have been thinking about writing our own tool. I've looked at different modules on CPAN and found many things that I think might help, but don't know if they are any good. Does anyone have any experience builing or writing tools for extracting text from files???? If so what did you use, how did you do it, how big was your project, did you use modules, did you use any API's or .dll's from Microsoft for all of the Microsoft files (.doc, .ppt, .xls, etc.)???
We are looking to build an in house tool to do all our extraction and need advice on where to start.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.