in reply to Re: Creating Metadata from Text File
in thread Creating Metadata from Text File

Storing pages and pages of data is not what I want to do. Per my application if I have the following "document" I will get the following metadata.
Mary had a little lamb,
Its fleece was white as snow;
And everywhere that Mary went,
The lamb was sure to go.

He followed her to school one day;
That was against the rule;
It made the children laugh and play;
To see a lamb at school.
And so the teacher turned it out,
But still it lingered near,
And waited patiently about
Till Mary did appear.
"Why does the lamb love Mary so?"
The eager children cry;
"Why, Mary loves the lamb, you know,"
The teacher did reply.

Removing command words, repeating words, and punctuation you get: Mary little lamb fleece white snow followed school day rules made children laugh play teacher turned near waited patiently appear love eager children reply. 155 Characters with Spaces. from 460 Characters with spaces.

Replies are listed 'Best First'.
Re^3: Creating Metadata from Text File
by leocharre (Priest) on Jul 23, 2007 at 13:47 UTC

    You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk.

    Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string.

    So, by your example, loosely applied:

    #!/usr/bin/perl -w use strict; my $song = "The piano has been drinking My necktie's asleep The combo went back to New York, and left me all alone The jukebox has to take a leak Have you noticed that the carpet needs a haircut? And the spotlight looks just like a prison break And the telephone's out of cigarettes As usual the balcony's on the make And the piano has been drinking, heavily The piano has been drinking And he's on the hard stuff tonight The piano has been drinking And you can't find your waitress Even with the Geiger counter And I guarantee you that she will hate you From the bottom of her glass And all of your friends remind you That you just can't get served without her The piano has been drinking The piano has been drinking And the lightman's blind in one eye And he can't see out of the other And the piano-tuner's got a hearing aid And he showed up with his mother And the piano has been drinking Without fear of contradiction I say The piano has been drinking Our Father who art in ? Hallowed by thy glass Thy kindom come, thy will be done On Earth as it is in the lounges Give us this day our daily splash Forgive us our hangovers As we forgive all those who continue to hangover against us And lead us not into temptation But deliver from evil and someone you must all ride home Because the piano has been drinking And he's your friend not mine Because the piano has been drinking And he's not my responsibility The bouncer is this Sumo wrestler Kinda cream puff casper milk toast And the owner is just a mental midget With the I.Q. of a fencepost I'm going down, hang onto me, I'm going down Watch me skate across an acre of linoleum I know I can do it, I'm in total control And the piano has been drinking And he's embarassing me The piano has been drinking, he raided his mini bar The piano has been drinking And the bar stools are all on fire And all the newspapers were just fooling And the ash-trays have retired And I've got a feeling that the piano has been drinking It's just a hunch The piano has been drinking and he's going to lose his lunch And the piano has been drinking Not me, not me, The piano has been drinking not me"; my $word={}; my $words=[]; while( $song=~/[\W]*(\w+)[\W]*/g ){ unless(exists $word->{$1}){ push @$words, $1; } $word->{$1}++; } my $summary = join(' ', @$words); printf "original charcount: %s new charcount: %s\n words selected: %s\n\n", length $song, length $summary, $summary;

    Even before we take out stop words, etc, we get:

    original charcount: 2129
    new charcount: 1126
    
    words selected:
    The piano has been drinking My necktie s asleep combo
     went back to New York and left me all alone jukebox
     take a leak Have you noticed that the carpet needs 
    haircut And spotlight looks just like prison break 
    telephone out of cigarettes As usual balcony on make
    heavily he hard stuff tonight can t find your waitress
     Even with Geiger counter I guarantee she will hate
     From bottom her glass friends remind That get served
     without lightman blind in one eye see other tuner 
    got hearing aid showed up his mother Without fear 
    contradiction say Our Father who art Hallowed by thy 
    Thy kindom come be done On Earth as it is lounges 
    Give us this day our daily splash Forgive hangovers
     we forgive those continue hangover against lead 
    not into temptation But deliver from evil someone 
    must ride home Because friend mine my responsibility 
    bouncer Sumo wrestler Kinda cream puff casper milk
     toast owner mental midget With Q fencepost m going
     down hang onto Watch skate across an acre linoleum 
    know do total control embarassing raided mini bar 
    stools are fire newspapers were fooling ash trays 
    have retired ve feeling It hunch lose lunch Not
    

    So, as a human being, searching for this song against your database, I would not be able to find it quite so easily.

    What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass. But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.