Most of my experience with regards to text extraction is in the context of parsing and extracting data and metadata from files produced in scientific experiments. With the exception of xls documents, these are custom formats for which I would create compilers to extract and transform what text I wanted.
File formats are little computer languages in disguise. So the general approach of creating a compiler from the format you start with to the format you want will always work in general. In practice, writing compilers for each format can be an arduous process made difficult by incomplete file format specifications, eg, .doc format.
In your case, you are ETL'ing standard, albeit very different formats. If I were you, I would take advantage of the programs that create these formats to do the extraction. Use Microsoft Word to convert .docs into plain text format. Use Excel to convert .xls files to CSV files. These can be scripted easily enough using VBA/Visual Basic/Visual C# and will extract all the text there is to extract in documents. From there, it is easy to write perl grograms to transform the resulting text to your custom needs.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.