Creating Metadata from Text File

Trihedralguy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Creating Metadata from Text File by FunkyMonk (Bishop) on Jul 20, 2007 at 16:52 UTC
I'd build a hash of all the words in the file and then remove the common words. Use each word as the key to the hash, it's value doesn't matter. `open my $IN, "<", "myfile.txt" or die $!; my %seen; while ( <$IN> ) { $seen{$_}++ for split; } delete $seen{$_} for qw/all my common words/; my @metadata = keys %seen;` [download]	[reply] [d/l]
Re: Creating Metadata from Text File by graff (Chancellor) on Jul 21, 2007 at 01:03 UTC
So you're talking about building an index for a set of documents, and using a list of "stop words" so that only the "useful" words are indexed. Presumably, for each "useful" word, you want to keep track of all the documents contain that word. As one of the other replies points out, a database server can be a good tool for this sort of thing, but the basic index list could start out as just list of rows containing two fields: "doc_id usable_word", to indicate that a particular useful word was found in a particular document. Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this: #!/usr/bin/perl use strict; ( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] ) or die "Usage: $0 stopword.list document.file\n"; my ( %stopwords, %docwords ); my ( $stopword_name, $document_name ) = @ARGV; open( I, "<", $stopword_name ) or die "$stopword_name: $!"; while (<I>) { my @words = grep /^[a-z]+$/, map { lc() } split /\W+/; $stopwords{$_} = undef for ( @words ); } close I; open( I, "<", $document_name ) or die "$document_name: $!"; while (<I>) { for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) { $docwords{$_} = undef unless ( exists( $stopwords{$_} )) } } close I; for (keys %docwords) { print "$document_name\t$_\n"; } [download] If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like: `SELECT distinct(doc_id) from doc_word_index where doc_word = ?"` [download] When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well. There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on. (update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.)	[reply] [d/l] [select]
Re^2: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 21, 2007 at 01:55 UTC
I love you, but now my weekend is ruined that I finally understand how to do this project...I'll keep you posted!! :)	[reply]
Re^3: Creating Metadata from Text File by graff (Chancellor) on Jul 22, 2007 at 01:34 UTC
One other thing: you may want to apply the stop-word list to the query terms that someone submits when doing a search. You know these words are not in the index, so why waste time querying for them? (It might even serve as a form of instruction for the user: "based on what you entered, here are the words being used in the search: ...") Also, after you load the index table and you know how many docs are indexed (let's say it's 5000), you might want to try a query like: `SELECT count(doc_id),doc_word from doc_word_index group by doc_word order by count(doc_id) desc limit 20` [download] If there are words that occur in all 5000 docs, you might as well add those to your stop list. (If the output of that particular query shows all 20 words with "5000", set the limit higher, to see how many words there are that occur in all documents.) In fact, if you start out by indexing all words, you can build your own stop list this way, and it might be more effective than just assuming that someone else's list of "most frequent words" is appropriate for your particular set of docs. You might also decide that the threshold for inclusion in the stop list is something like "occurs in 90% of docs", as opposed to "occurs in all docs". (The "document frequency" of words -- how many docs contain a given word -- can be a useful metric for assigning weights to search terms when you get into ranking the "hits" according to "relevance".) Note that the "most frequent words" list you cited includes things like "number", "sound", "water", "air", "father", "mother", "country", etc, but these might occur only in some of your docs -- someone might have a valid expectation that they would be useful as a search terms, and it would be wrong not to index them.	[reply] [d/l]
Re^4: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 23, 2007 at 12:44 UTC
Re^5: Creating Metadata from Text File by graff (Chancellor) on Jul 25, 2007 at 01:13 UTC
Re: Creating Metadata from Text File by EvanK (Chaplain) on Jul 20, 2007 at 16:37 UTC
You could slurp in said files (read everything, ignoring newlines/carriage returns), then split on whitespace: `# slurp file contents my $contents; { local $/ = undef; open(my $handle, '<', "filename") or die("error: $!"); $contents = <$handle>; close $handle; } # split into array on consecutive whitespace my @words = split /\s+/, $contents;` [download] As far as removing the common words, you could use `indexes` from List::MoreUtils to get the indices of the common words to remove them, OR get the indices of the not common words to add them to another array. One of the more experienced monks may have a better solution, though. __________ Systems development is like banging your head against a wall... It's usually very painful, but if you're persistent, you'll get through it.	[reply] [d/l] [select]
Re: Creating Metadata from Text File by poqui (Deacon) on Jul 20, 2007 at 18:00 UTC
You mention the "limit i think for an oracle varchar2 table is at least 5000 bytes"; I think you mean that a single Varchar2 column can hold about 4000 1 byte characters (less if using multibyte). Does that mean you are going to store the entire word list you are generating in a single Oracle column on a table? That could be a very large table... and searching against a freeform column (which is why I assume you are putting this data into an Oracle table) is extemely slow. What is the format for your metadata, ultimately? Are you attempting to encode RDF or RDFS or something else?	[reply]
Re^2: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 20, 2007 at 18:47 UTC
I'm indexing PDFs for a quick search of all of our PDF documentation. I will be sticking them into one column, but I hope none of the columns will come close to the 5000 max because of the fact that I'm doing all this "elimination' of common words, and duplicate words. I may eventually even just limit it to like the first x number of words as I feel if you are looking for a specific document about say apples, the word apples is going to appear withing the first couple of paragraphs at least. Do you have any other suggestions rather than going this route? Ulitmatly I'm just indexing the PDFs so that I can repoint back to them later. PDF is a good format for storing massive amounts of documentation, I'm just providing the ability to search all of them at once.	[reply]
Re^3: Creating Metadata from Text File by poqui (Deacon) on Jul 20, 2007 at 19:06 UTC
Yours sounds like an adequate "brute force" method; but if you have the time, you should take a look at RDF (Resource Desciption Format) which is the standard for metadata about documents and other things that a library might consider a "Resource"; its being extended to encompass other things as well; like code and databases; but it started right where you are at now. I suggest it because there are tools to search RDF for matching resources, based on subject and meaning, rather than just the appearance of certain words.	[reply]
Re^4: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 20, 2007 at 19:15 UTC
Re^5: Creating Metadata from Text File by poqui (Deacon) on Jul 20, 2007 at 19:23 UTC
Some notes below your chosen depth have not been shown here
Re: Creating Metadata from Text File by leocharre (Priest) on Jul 20, 2007 at 18:08 UTC
Some of the things you mention sound a little fuzzy, vague. I am asuming you are trying to index content. That's not exactly metadata, but.. sure. It is data about data. My present way to go about doing what sounds like your task, is to actually break up the contents into a 'data' table, record page, line, and content. Essentially I record everything. I have about 30 million or so data rows. The database was at 3 gigs i think, last time i checked. Sounds crazy? Well, it works. I can do searches just fine. Saving all content may be overkill for you. At least if prevents me from making a judgement call about what is disposable data and what is not. mysql> describe data; +-------------+---------+------+-----+---------+-------+ \| Field \| Type \| Null \| Key \| Default \| Extra \| +-------------+---------+------+-----+---------+-------+ \| id \| int(10) \| \| PRI \| 0 \| \| \| page_number \| int(10) \| \| PRI \| 0 \| \| \| line_number \| int(10) \| \| PRI \| 0 \| \| \| content \| text \| \| MUL \| \| \| +-------------+---------+------+-----+---------+-------+ 4 rows in set (0.04 sec) That's my little table. I'm sure it could be improved further.	[reply]
Re^2: Creating Metadata from Text File by Trihedralguy (Pilgrim) on Jul 20, 2007 at 18:59 UTC
Storing pages and pages of data is not what I want to do. Per my application if I have the following "document" I will get the following metadata. Mary had a little lamb, Its fleece was white as snow; And everywhere that Mary went, The lamb was sure to go. He followed her to school one day; That was against the rule; It made the children laugh and play; To see a lamb at school. And so the teacher turned it out, But still it lingered near, And waited patiently about Till Mary did appear. "Why does the lamb love Mary so?" The eager children cry; "Why, Mary loves the lamb, you know," The teacher did reply. Removing command words, repeating words, and punctuation you get: Mary little lamb fleece white snow followed school day rules made children laugh play teacher turned near waited patiently appear love eager children reply. 155 Characters with Spaces. from 460 Characters with spaces.	[reply]
Re^3: Creating Metadata from Text File by leocharre (Priest) on Jul 23, 2007 at 13:47 UTC
You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk. Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string. So, by your example, loosely applied: #!/usr/bin/perl -w use strict; my $song = "The piano has been drinking My necktie's asleep The combo went back to New York, and left me all alone The jukebox has to take a leak Have you noticed that the carpet needs a haircut? And the spotlight looks just like a prison break And the telephone's out of cigarettes As usual the balcony's on the make And the piano has been drinking, heavily The piano has been drinking And he's on the hard stuff tonight The piano has been drinking And you can't find your waitress Even with the Geiger counter And I guarantee you that she will hate you From the bottom of her glass And all of your friends remind you That you just can't get served without her The piano has been drinking The piano has been drinking And the lightman's blind in one eye And he can't see out of the other And the piano-tuner's got a hearing aid And he showed up with his mother And the piano has been drinking Without fear of contradiction I say The piano has been drinking Our Father who art in ? Hallowed by thy glass Thy kindom come, thy will be done On Earth as it is in the lounges Give us this day our daily splash Forgive us our hangovers As we forgive all those who continue to hangover against us And lead us not into temptation But deliver from evil and someone you must all ride home Because the piano has been drinking And he's your friend not mine Because the piano has been drinking And he's not my responsibility The bouncer is this Sumo wrestler Kinda cream puff casper milk toast And the owner is just a mental midget With the I.Q. of a fencepost I'm going down, hang onto me, I'm going down Watch me skate across an acre of linoleum I know I can do it, I'm in total control And the piano has been drinking And he's embarassing me The piano has been drinking, he raided his mini bar The piano has been drinking And the bar stools are all on fire And all the newspapers were just fooling And the ash-trays have retired And I've got a feeling that the piano has been drinking It's just a hunch The piano has been drinking and he's going to lose his lunch And the piano has been drinking Not me, not me, The piano has been drinking not me"; my $word={}; my $words=[]; while( $song=~/[\W](\w+)[\W]/g ){ unless(exists $word->{$1}){ push @$words, $1; } $word->{$1}++; } my $summary = join(' ', @$words); printf "original charcount: %s new charcount: %s\n words selected: %s\n\n", length $song, length $summary, $summary; [download] Even before we take out stop words, etc, we get: original charcount: 2129 new charcount: 1126 words selected: The piano has been drinking My necktie s asleep combo went back to New York and left me all alone jukebox take a leak Have you noticed that the carpet needs haircut And spotlight looks just like prison break telephone out of cigarettes As usual balcony on make heavily he hard stuff tonight can t find your waitress Even with Geiger counter I guarantee she will hate From bottom her glass friends remind That get served without lightman blind in one eye see other tuner got hearing aid showed up his mother Without fear contradiction say Our Father who art Hallowed by thy Thy kindom come be done On Earth as it is lounges Give us this day our daily splash Forgive hangovers we forgive those continue hangover against lead not into temptation But deliver from evil someone must ride home Because friend mine my responsibility bouncer Sumo wrestler Kinda cream puff casper milk toast owner mental midget With Q fencepost m going down hang onto Watch skate across an acre linoleum know do total control embarassing raided mini bar stools are fire newspapers were fooling ash trays have retired ve feeling It hunch lose lunch Not So, as a human being, searching for this song against your database, I would not be able to find it quite so easily. What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass. But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.	[reply] [d/l]