I'm still very new to programming in Perl. I was wondering if there is a method for create metadata from a text file. Basically this is what I need to do:

Read in a .txt file (I've got this done, simple enough)
Find all common words like "The, This, Then, And, Or, ETC.)
(See http://esl.about.com/library/vocabulary/bl1000_list1.htm for the list of most common words.)
Then we are going to take what is left and start creating our metadata. But while we are pulling words that are not common, we want to check to see if its already pulled that word before. (I guess through an easy loop to check maybe an array.)
Finally, populate the metadata into a database so that when you do a search you will find that text file.
The text file actually starts as a PDF and through PDFtoTXT its converted to a text file.

So basically my question is how can I go about reading one word at a time, and then how can I go about quickly removing all common words. (I assume you'd put all the common words in a array of some sort and then check the array vs the word currently being checked.)
I know PDF documents MIGHT be very long, but the limit i think for an oracle varchar2 table is at least 5000 bytes. So if all else fails I'll just truncate metadata over 5000 bytes (charaters.)

In reply to Creating Metadata from Text File by Trihedralguy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.