Re: Creating Metadata from Text File
by FunkyMonk (Bishop) on Jul 20, 2007 at 16:52 UTC
|
I'd build a hash of all the words in the file and then remove the common words. Use each word as the key to the hash, it's value doesn't matter.
open my $IN, "<", "myfile.txt" or die $!;
my %seen;
while ( <$IN> ) {
$seen{$_}++ for split;
}
delete $seen{$_} for qw/all my common words/;
my @metadata = keys %seen;
| [reply] [d/l] |
Re: Creating Metadata from Text File
by graff (Chancellor) on Jul 21, 2007 at 01:03 UTC
|
So you're talking about building an index for a set of documents, and using a list of "stop words" so that only the "useful" words are indexed. Presumably, for each "useful" word, you want to keep track of all the documents contain that word. As one of the other replies points out, a database server can be a good tool for this sort of thing, but the basic index list could start out as just list of rows containing two fields: "doc_id usable_word", to indicate that a particular useful word was found in a particular document.
Since you already know where your list of stop words (the non-useful words) comes from, you could start out like this:
#!/usr/bin/perl
use strict;
( @ARGV == 2 and -f $ARGV[0] and -f $ARGV[1] )
or die "Usage: $0 stopword.list document.file\n";
my ( %stopwords, %docwords );
my ( $stopword_name, $document_name ) = @ARGV;
open( I, "<", $stopword_name ) or die "$stopword_name: $!";
while (<I>) {
my @words = grep /^[a-z]+$/, map { lc() } split /\W+/;
$stopwords{$_} = undef for ( @words );
}
close I;
open( I, "<", $document_name ) or die "$document_name: $!";
while (<I>) {
for ( grep /^[a-z]+$/, map { lc() } split /\W+/ ) {
$docwords{$_} = undef unless ( exists( $stopwords{$_} ))
}
}
close I;
for (keys %docwords) {
print "$document_name\t$_\n";
}
If you run that on each document file, and concatenate all the outputs together into a simple two column table, you can then provide a search tool that uses a simple query like:
SELECT distinct(doc_id) from doc_word_index where doc_word = ?"
When a user wants all docs that contain "foo" or "bar" (or "baz" or ...), just keep adding " or doc_word = ?" clauses on that query. Other boolean queries ("this_word and that_word", etc) can be set up easily as well.
There are plenty more bells and whistles you can add as you come up with them... things like "stemming" (so a doc that contains only "blooming" or "blooms" or "bloomed" will be found when the search term is "bloom"), "relevance" (sort the returned list based on counting the number of distinct search terms per doc), and so on.
(update -- forgot to mention: When building a simple table like that, don't forget to tell the database system to create an index on the "doc_word" column, so that the queries can be answered quickly, without having to do a full-table scan every time.) | [reply] [d/l] [select] |
|
|
I love you, but now my weekend is ruined that I finally understand how to do this project...I'll keep you posted!! :)
| [reply] |
|
|
SELECT count(doc_id),doc_word from doc_word_index group by doc_word
order by count(doc_id) desc limit 20
If there are words that occur in all 5000 docs, you might as well add those to your stop list. (If the output of that particular query shows all 20 words with "5000", set the limit higher, to see how many words there are that occur in all documents.)
In fact, if you start out by indexing all words, you can build your own stop list this way, and it might be more effective than just assuming that someone else's list of "most frequent words" is appropriate for your particular set of docs. You might also decide that the threshold for inclusion in the stop list is something like "occurs in 90% of docs", as opposed to "occurs in all docs". (The "document frequency" of words -- how many docs contain a given word -- can be a useful metric for assigning weights to search terms when you get into ranking the "hits" according to "relevance".)
Note that the "most frequent words" list you cited includes things like "number", "sound", "water", "air", "father", "mother", "country", etc, but these might occur only in some of your docs -- someone might have a valid expectation that they would be useful as a search terms, and it would be wrong not to index them. | [reply] [d/l] |
|
|
|
|
Re: Creating Metadata from Text File
by EvanK (Chaplain) on Jul 20, 2007 at 16:37 UTC
|
You could slurp in said files (read everything, ignoring newlines/carriage returns), then split on whitespace:
# slurp file contents
my $contents;
{
local $/ = undef;
open(my $handle, '<', "filename") or die("error: $!");
$contents = <$handle>;
close $handle;
}
# split into array on consecutive whitespace
my @words = split /\s+/, $contents;
As far as removing the common words, you could use indexes from List::MoreUtils to get the indices of the common words to remove them, OR get the indices of the not common words to add them to another array. One of the more experienced monks may have a better solution, though.
__________
Systems development is like banging your head against a wall...
It's usually very painful, but if you're persistent, you'll get through it.
| [reply] [d/l] [select] |
Re: Creating Metadata from Text File
by poqui (Deacon) on Jul 20, 2007 at 18:00 UTC
|
You mention the "limit i think for an oracle varchar2 table is at least 5000 bytes"; I think you mean that a single Varchar2 column can hold about 4000 1 byte characters (less if using multibyte).
Does that mean you are going to store the entire word list you are generating in a single Oracle column on a table? That could be a very large table... and searching against a freeform column (which is why I assume you are putting this data into an Oracle table) is *extemely* slow.
What is the format for your metadata, ultimately? Are you attempting to encode RDF or RDFS or something else? | [reply] |
|
|
I'm indexing PDFs for a quick search of all of our PDF documentation. I will be sticking them into one column, but I hope none of the columns will come close to the 5000 max because of the fact that I'm doing all this "elimination' of common words, and duplicate words. I may eventually even just limit it to like the first x number of words as I feel if you are looking for a specific document about say apples, the word apples is going to appear withing the first couple of paragraphs at least.
Do you have any other suggestions rather than going this route?
Ulitmatly I'm just indexing the PDFs so that I can repoint back to them later. PDF is a good format for storing massive amounts of documentation, I'm just providing the ability to search all of them at once.
| [reply] |
|
|
Yours sounds like an adequate "brute force" method; but if you have the time, you should take a look at RDF (Resource Desciption Format) which is the standard for metadata about documents and other things that a library might consider a "Resource"; its being extended to encompass other things as well; like code and databases; but it started right where you are at now.
I suggest it because there are tools to search RDF for matching resources, based on subject and meaning, rather than just the appearance of certain words.
| [reply] |
|
|
|
|
|
Re: Creating Metadata from Text File
by leocharre (Priest) on Jul 20, 2007 at 18:08 UTC
|
Some of the things you mention sound a little fuzzy, vague.
I am asuming you are trying to index content.
That's not exactly metadata, but.. sure. It *is* data about data.
My present way to go about doing what sounds like your task, is to actually break up the contents into a 'data' table, record page, line, and content. Essentially I record *everything*. I have about 30 million or so data rows. The database was at 3 gigs i think, last time i checked.
Sounds crazy? Well, it works. I can do searches just fine.
Saving all content may be overkill for you.
At least if prevents me from making a judgement call about what is disposable data and what is not.
mysql> describe data;
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| id | int(10) | | PRI | 0 | |
| page_number | int(10) | | PRI | 0 | |
| line_number | int(10) | | PRI | 0 | |
| content | text | | MUL | | |
+-------------+---------+------+-----+---------+-------+
4 rows in set (0.04 sec)
That's my little table. I'm sure it could be improved further.
| [reply] |
|
|
Storing pages and pages of data is not what I want to do. Per my application if I have the following "document" I will get the following metadata.
Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.
He followed her to school one day; That was against the rule;
It made the children laugh and play;
To see a lamb at school.
And so the teacher turned it out,
But still it lingered near,
And waited patiently about
Till Mary did appear.
"Why does the lamb love Mary so?"
The eager children cry;
"Why, Mary loves the lamb, you know,"
The teacher did reply.
Removing command words, repeating words, and punctuation you get:
Mary little lamb fleece white snow followed school day rules
made children laugh play teacher turned near
waited patiently appear love eager children reply.
155 Characters with Spaces.
from 460 Characters with spaces.
| [reply] |
|
|
You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk.
Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string.
So, by your example, loosely applied:
#!/usr/bin/perl -w
use strict;
my $song = "The piano has been drinking
My necktie's asleep
The combo went back to New York, and left me all alone
The jukebox has to take a leak
Have you noticed that the carpet needs a haircut?
And the spotlight looks just like a prison break
And the telephone's out of cigarettes
As usual the balcony's on the make
And the piano has been drinking, heavily
The piano has been drinking
And he's on the hard stuff tonight
The piano has been drinking
And you can't find your waitress
Even with the Geiger counter
And I guarantee you that she will hate you
From the bottom of her glass
And all of your friends remind you
That you just can't get served without her
The piano has been drinking
The piano has been drinking
And the lightman's blind in one eye
And he can't see out of the other
And the piano-tuner's got a hearing aid
And he showed up with his mother
And the piano has been drinking
Without fear of contradiction I say
The piano has been drinking
Our Father who art in ?
Hallowed by thy glass
Thy kindom come, thy will be done
On Earth as it is in the lounges
Give us this day our daily splash
Forgive us our hangovers
As we forgive all those who continue to hangover against us
And lead us not into temptation
But deliver from evil and someone you must all ride home
Because the piano has been drinking
And he's your friend not mine
Because the piano has been drinking
And he's not my responsibility
The bouncer is this Sumo wrestler
Kinda cream puff casper milk toast
And the owner is just a mental midget
With the I.Q. of a fencepost
I'm going down, hang onto me, I'm going down
Watch me skate across an acre of linoleum
I know I can do it, I'm in total control
And the piano has been drinking
And he's embarassing me
The piano has been drinking, he raided his mini bar
The piano has been drinking
And the bar stools are all on fire
And all the newspapers were just fooling
And the ash-trays have retired
And I've got a feeling that the piano has been drinking
It's just a hunch
The piano has been drinking and he's going to lose his lunch
And the piano has been drinking
Not me, not me, The piano has been drinking not me";
my $word={};
my $words=[];
while( $song=~/[\W]*(\w+)[\W]*/g ){
unless(exists $word->{$1}){
push @$words, $1;
}
$word->{$1}++;
}
my $summary = join(' ', @$words);
printf "original charcount: %s
new charcount: %s\n
words selected:
%s\n\n", length $song, length $summary, $summary;
Even before we take out stop words, etc, we get:
original charcount: 2129
new charcount: 1126
words selected:
The piano has been drinking My necktie s asleep combo
went back to New York and left me all alone jukebox
take a leak Have you noticed that the carpet needs
haircut And spotlight looks just like prison break
telephone out of cigarettes As usual balcony on make
heavily he hard stuff tonight can t find your waitress
Even with Geiger counter I guarantee she will hate
From bottom her glass friends remind That get served
without lightman blind in one eye see other tuner
got hearing aid showed up his mother Without fear
contradiction say Our Father who art Hallowed by thy
Thy kindom come be done On Earth as it is lounges
Give us this day our daily splash Forgive hangovers
we forgive those continue hangover against lead
not into temptation But deliver from evil someone
must ride home Because friend mine my responsibility
bouncer Sumo wrestler Kinda cream puff casper milk
toast owner mental midget With Q fencepost m going
down hang onto Watch skate across an acre linoleum
know do total control embarassing raided mini bar
stools are fire newspapers were fooling ash trays
have retired ve feeling It hunch lose lunch Not
So, as a human being, searching for this song against your database, I would not be able to find it quite so easily.
What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass.
But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.
| [reply] [d/l] |