You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk.

Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string.

So, by your example, loosely applied:

#!/usr/bin/perl -w use strict; my $song = "The piano has been drinking My necktie's asleep The combo went back to New York, and left me all alone The jukebox has to take a leak Have you noticed that the carpet needs a haircut? And the spotlight looks just like a prison break And the telephone's out of cigarettes As usual the balcony's on the make And the piano has been drinking, heavily The piano has been drinking And he's on the hard stuff tonight The piano has been drinking And you can't find your waitress Even with the Geiger counter And I guarantee you that she will hate you From the bottom of her glass And all of your friends remind you That you just can't get served without her The piano has been drinking The piano has been drinking And the lightman's blind in one eye And he can't see out of the other And the piano-tuner's got a hearing aid And he showed up with his mother And the piano has been drinking Without fear of contradiction I say The piano has been drinking Our Father who art in ? Hallowed by thy glass Thy kindom come, thy will be done On Earth as it is in the lounges Give us this day our daily splash Forgive us our hangovers As we forgive all those who continue to hangover against us And lead us not into temptation But deliver from evil and someone you must all ride home Because the piano has been drinking And he's your friend not mine Because the piano has been drinking And he's not my responsibility The bouncer is this Sumo wrestler Kinda cream puff casper milk toast And the owner is just a mental midget With the I.Q. of a fencepost I'm going down, hang onto me, I'm going down Watch me skate across an acre of linoleum I know I can do it, I'm in total control And the piano has been drinking And he's embarassing me The piano has been drinking, he raided his mini bar The piano has been drinking And the bar stools are all on fire And all the newspapers were just fooling And the ash-trays have retired And I've got a feeling that the piano has been drinking It's just a hunch The piano has been drinking and he's going to lose his lunch And the piano has been drinking Not me, not me, The piano has been drinking not me"; my $word={}; my $words=[]; while( $song=~/[\W]*(\w+)[\W]*/g ){ unless(exists $word->{$1}){ push @$words, $1; } $word->{$1}++; } my $summary = join(' ', @$words); printf "original charcount: %s new charcount: %s\n words selected: %s\n\n", length $song, length $summary, $summary;

Even before we take out stop words, etc, we get:

original charcount: 2129
new charcount: 1126

words selected:
The piano has been drinking My necktie s asleep combo
 went back to New York and left me all alone jukebox
 take a leak Have you noticed that the carpet needs 
haircut And spotlight looks just like prison break 
telephone out of cigarettes As usual balcony on make
heavily he hard stuff tonight can t find your waitress
 Even with Geiger counter I guarantee she will hate
 From bottom her glass friends remind That get served
 without lightman blind in one eye see other tuner 
got hearing aid showed up his mother Without fear 
contradiction say Our Father who art Hallowed by thy 
Thy kindom come be done On Earth as it is lounges 
Give us this day our daily splash Forgive hangovers
 we forgive those continue hangover against lead 
not into temptation But deliver from evil someone 
must ride home Because friend mine my responsibility 
bouncer Sumo wrestler Kinda cream puff casper milk
 toast owner mental midget With Q fencepost m going
 down hang onto Watch skate across an acre linoleum 
know do total control embarassing raided mini bar 
stools are fire newspapers were fooling ash trays 
have retired ve feeling It hunch lose lunch Not

So, as a human being, searching for this song against your database, I would not be able to find it quite so easily.

What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass. But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.


In reply to Re^3: Creating Metadata from Text File by leocharre
in thread Creating Metadata from Text File by Trihedralguy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.