Greetings to the esteemed monks. I'm back after a decade, writing perl code again. Woe is me.
I hope this isn't too general a question, but I'm trying to avoid rediscovering fire.
I have about 10,000 files, most of them image-only scans of printed documents in .pdf format, and I need to do two things. I need to build a helper program to partially or entirely concoct appropriate file names based upon some of the content to suggest to a user who will either make changes or accept the perl-generated results, then I have to put all these files on a password-protected web site and make the whole mess searchable. Here is my best guess at how to do these things.
1. Extract the image from each file using ImageMagic, then turn it into a separate, but linked, text file using Tessaract to perform OCR.
2. Now, I can use the text file as input to my renaming assistant which will use regular expressions to identify keywords.
3. Then, I can store the OCR text and the linked original image in a MySQL database on the web site, and use SQL commands to do string searches as users request in a HTML search box.
I can write the perl code all right, but I'm not sure if this is the best, or right, way to set up the project. Is there as better approach?
Oh, and I have looked at several online custom search vendors, but the need for security of the data, and the inability of those vendors to search password-protected data probably rules out that approach, I am sad to discover.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.