Trihedralguy has asked for the wisdom of the Perl Monks concerning the following question:

I'm still very new to programming in Perl. I was wondering if there is a method for create metadata from a text file. Basically this is what I need to do:

Read in a .txt file (I've got this done, simple enough)
Find all common words like "The, This, Then, And, Or, ETC.)
(See http://esl.about.com/library/vocabulary/bl1000_list1.htm for the list of most common words.)
Then we are going to take what is left and start creating our metadata. But while we are pulling words that are not common, we want to check to see if its already pulled that word before. (I guess through an easy loop to check maybe an array.)
Finally, populate the metadata into a database so that when you do a search you will find that text file.
The text file actually starts as a PDF and through PDFtoTXT its converted to a text file.

So basically my question is how can I go about reading one word at a time, and then how can I go about quickly removing all common words. (I assume you'd put all the common words in a array of some sort and then check the array vs the word currently being checked.)
I know PDF documents MIGHT be very long, but the limit i think for an oracle varchar2 table is at least 5000 bytes. So if all else fails I'll just truncate metadata over 5000 bytes (charaters.)

Replies are listed 'Best First'.
Re: Creating Metadata from Text File
by FunkyMonk (Bishop) on Jul 20, 2007 at 16:52 UTC
    I'd build a hash of all the words in the file and then remove the common words. Use each word as the key to the hash, it's value doesn't matter.
    open my $IN, "<", "myfile.txt" or die $!; my %seen; while ( <$IN> ) { $seen{$_}++ for split; } delete $seen{$_} for qw/all my common words/; my @metadata = keys %seen;

Re: Creating Metadata from Text File
by graff (Chancellor) on Jul 21, 2007 at 01:03 UTC
    So you're talking about building an index for a set of documents, and using a list of "stop words" so that only the "useful" words are indexed. Presumably, for each "useful" word, you want to keep track of all the documents contain that word. As one of the other replies points out, a database server can be a good tool for this sort of thing, but the basic index list could start out as just list of rows containing two fields: "doc_id usable_word", to indicate that a particular useful word was found in a particular document.

    Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this:

    #!/usr/bin/perl use strict; ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n"; my ( %stopwords, %docwords ); my ( $stopword_name, $document_name ) = @ARGV; open( I, "<", $stopword_name ) or die "$stopword_name: $!"; while (<I>) { my @words = grep /^[a-z]+$/, map { lc() } split /\W+/; $stopwords{$_} = undef for ( @words ); } close I; open( I, "<", $document_name ) or die "$document_name: $!"; while (<I>) { for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) { $docwords{$_} = undef unless ( exists( $stopwords{$_} )) } } close I; for (keys %docwords) { print "$document_name\t$_\n"; }
    If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like:
    SELECT distinct(doc_id) from doc_word_index where doc_word = ?"
    When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well.

    There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on.

    (update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.)

      I love you, but now my weekend is ruined that I finally understand how to do this project...I'll keep you posted!! :)
        One other thing: you may want to apply the stop-word list to the query terms that someone submits when doing a search. You know these words are not in the index, so why waste time querying for them? (It might even serve as a form of instruction for the user: "based on what you entered, here are the words being used in the search: ...")

        Also, after you load the index table and you know how many docs are indexed (let's say it's 5000), you might want to try a query like:

        SELECT count(doc_id),doc_word from doc_word_index group by doc_word order by count(doc_id) desc limit 20
        If there are words that occur in all 5000 docs, you might as well add those to your stop list. (If the output of that particular query shows all 20 words with "5000", set the limit higher, to see how many words there are that occur in all documents.)

        In fact, if you start out by indexing all words, you can build your own stop list this way, and it might be more effective than just assuming that someone else's list of "most frequent words" is appropriate for your particular set of docs. You might also decide that the threshold for inclusion in the stop list is something like "occurs in 90% of docs", as opposed to "occurs in all docs". (The "document frequency" of words -- how many docs contain a given word -- can be a useful metric for assigning weights to search terms when you get into ranking the "hits" according to "relevance".)

        Note that the "most frequent words" list you cited includes things like "number", "sound", "water", "air", "father", "mother", "country", etc, but these might occur only in some of your docs -- someone might have a valid expectation that they would be useful as a search terms, and it would be wrong not to index them.

Re: Creating Metadata from Text File
by EvanK (Chaplain) on Jul 20, 2007 at 16:37 UTC
    You could slurp in said files (read everything, ignoring newlines/carriage returns), then split on whitespace:
    # slurp file contents my $contents; { local $/ = undef; open(my $handle, '<', "filename") or die("error: $!"); $contents = <$handle>; close $handle; } # split into array on consecutive whitespace my @words = split /\s+/, $contents;
    As far as removing the common words, you could use indexes from List::MoreUtils to get the indices of the common words to remove them, OR get the indices of the not common words to add them to another array. One of the more experienced monks may have a better solution, though.

    __________
    Systems development is like banging your head against a wall...
    It's usually very painful, but if you're persistent, you'll get through it.

Re: Creating Metadata from Text File
by poqui (Deacon) on Jul 20, 2007 at 18:00 UTC
    You mention the "limit i think for an oracle varchar2 table is at least 5000 bytes"; I think you mean that a single Varchar2 column can hold about 4000 1 byte characters (less if using multibyte).

    Does that mean you are going to store the entire word list you are generating in a single Oracle column on a table? That could be a very large table... and searching against a freeform column (which is why I assume you are putting this data into an Oracle table) is *extemely* slow.

    What is the format for your metadata, ultimately? Are you attempting to encode RDF or RDFS or something else?
      I'm indexing PDFs for a quick search of all of our PDF documentation. I will be sticking them into one column, but I hope none of the columns will come close to the 5000 max because of the fact that I'm doing all this "elimination' of common words, and duplicate words. I may eventually even just limit it to like the first x number of words as I feel if you are looking for a specific document about say apples, the word apples is going to appear withing the first couple of paragraphs at least.
      Do you have any other suggestions rather than going this route?
      Ulitmatly I'm just indexing the PDFs so that I can repoint back to them later. PDF is a good format for storing massive amounts of documentation, I'm just providing the ability to search all of them at once.
        Yours sounds like an adequate "brute force" method; but if you have the time, you should take a look at RDF (Resource Desciption Format) which is the standard for metadata about documents and other things that a library might consider a "Resource"; its being extended to encompass other things as well; like code and databases; but it started right where you are at now.

        I suggest it because there are tools to search RDF for matching resources, based on subject and meaning, rather than just the appearance of certain words.
Re: Creating Metadata from Text File
by leocharre (Priest) on Jul 20, 2007 at 18:08 UTC

    Some of the things you mention sound a little fuzzy, vague.

    I am asuming you are trying to index content. That's not exactly metadata, but.. sure. It *is* data about data.

    My present way to go about doing what sounds like your task, is to actually break up the contents into a 'data' table, record page, line, and content. Essentially I record *everything*. I have about 30 million or so data rows. The database was at 3 gigs i think, last time i checked.

    Sounds crazy? Well, it works. I can do searches just fine. Saving all content may be overkill for you.

    At least if prevents me from making a judgement call about what is disposable data and what is not.

    mysql> describe data;
    +-------------+---------+------+-----+---------+-------+
    | Field       | Type    | Null | Key | Default | Extra |
    +-------------+---------+------+-----+---------+-------+
    | id          | int(10) |      | PRI | 0       |       |
    | page_number | int(10) |      | PRI | 0       |       |
    | line_number | int(10) |      | PRI | 0       |       |
    | content     | text    |      | MUL |         |       |
    +-------------+---------+------+-----+---------+-------+
    4 rows in set (0.04 sec)
    

    That's my little table. I'm sure it could be improved further.

      Storing pages and pages of data is not what I want to do. Per my application if I have the following "document" I will get the following metadata.
      Mary had a little lamb,
      Its fleece was white as snow;
      And everywhere that Mary went,
      The lamb was sure to go.

      He followed her to school one day;
      That was against the rule;
      It made the children laugh and play;
      To see a lamb at school.
      And so the teacher turned it out,
      But still it lingered near,
      And waited patiently about
      Till Mary did appear.
      "Why does the lamb love Mary so?"
      The eager children cry;
      "Why, Mary loves the lamb, you know,"
      The teacher did reply.

      Removing command words, repeating words, and punctuation you get: Mary little lamb fleece white snow followed school day rules made children laugh play teacher turned near waited patiently appear love eager children reply. 155 Characters with Spaces. from 460 Characters with spaces.

        You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk.

        Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string.

        So, by your example, loosely applied:

        #!/usr/bin/perl -w use strict; my $song = "The piano has been drinking My necktie's asleep The combo went back to New York, and left me all alone The jukebox has to take a leak Have you noticed that the carpet needs a haircut? And the spotlight looks just like a prison break And the telephone's out of cigarettes As usual the balcony's on the make And the piano has been drinking, heavily The piano has been drinking And he's on the hard stuff tonight The piano has been drinking And you can't find your waitress Even with the Geiger counter And I guarantee you that she will hate you From the bottom of her glass And all of your friends remind you That you just can't get served without her The piano has been drinking The piano has been drinking And the lightman's blind in one eye And he can't see out of the other And the piano-tuner's got a hearing aid And he showed up with his mother And the piano has been drinking Without fear of contradiction I say The piano has been drinking Our Father who art in ? Hallowed by thy glass Thy kindom come, thy will be done On Earth as it is in the lounges Give us this day our daily splash Forgive us our hangovers As we forgive all those who continue to hangover against us And lead us not into temptation But deliver from evil and someone you must all ride home Because the piano has been drinking And he's your friend not mine Because the piano has been drinking And he's not my responsibility The bouncer is this Sumo wrestler Kinda cream puff casper milk toast And the owner is just a mental midget With the I.Q. of a fencepost I'm going down, hang onto me, I'm going down Watch me skate across an acre of linoleum I know I can do it, I'm in total control And the piano has been drinking And he's embarassing me The piano has been drinking, he raided his mini bar The piano has been drinking And the bar stools are all on fire And all the newspapers were just fooling And the ash-trays have retired And I've got a feeling that the piano has been drinking It's just a hunch The piano has been drinking and he's going to lose his lunch And the piano has been drinking Not me, not me, The piano has been drinking not me"; my $word={}; my $words=[]; while( $song=~/[\W]*(\w+)[\W]*/g ){ unless(exists $word->{$1}){ push @$words, $1; } $word->{$1}++; } my $summary = join(' ', @$words); printf "original charcount: %s new charcount: %s\n words selected: %s\n\n", length $song, length $summary, $summary;

        Even before we take out stop words, etc, we get:

        original charcount: 2129
        new charcount: 1126
        
        words selected:
        The piano has been drinking My necktie s asleep combo
         went back to New York and left me all alone jukebox
         take a leak Have you noticed that the carpet needs 
        haircut And spotlight looks just like prison break 
        telephone out of cigarettes As usual balcony on make
        heavily he hard stuff tonight can t find your waitress
         Even with Geiger counter I guarantee she will hate
         From bottom her glass friends remind That get served
         without lightman blind in one eye see other tuner 
        got hearing aid showed up his mother Without fear 
        contradiction say Our Father who art Hallowed by thy 
        Thy kindom come be done On Earth as it is lounges 
        Give us this day our daily splash Forgive hangovers
         we forgive those continue hangover against lead 
        not into temptation But deliver from evil someone 
        must ride home Because friend mine my responsibility 
        bouncer Sumo wrestler Kinda cream puff casper milk
         toast owner mental midget With Q fencepost m going
         down hang onto Watch skate across an acre linoleum 
        know do total control embarassing raided mini bar 
        stools are fire newspapers were fooling ash trays 
        have retired ve feeling It hunch lose lunch Not
        

        So, as a human being, searching for this song against your database, I would not be able to find it quite so easily.

        What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass. But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.