in reply to Creating Metadata from Text File

Some of the things you mention sound a little fuzzy, vague.

I am asuming you are trying to index content. That's not exactly metadata, but.. sure. It *is* data about data.

My present way to go about doing what sounds like your task, is to actually break up the contents into a 'data' table, record page, line, and content. Essentially I record *everything*. I have about 30 million or so data rows. The database was at 3 gigs i think, last time i checked.

Sounds crazy? Well, it works. I can do searches just fine. Saving all content may be overkill for you.

At least if prevents me from making a judgement call about what is disposable data and what is not.

mysql> describe data;
+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| id          | int(10) |      | PRI | 0       |       |
| page_number | int(10) |      | PRI | 0       |       |
| line_number | int(10) |      | PRI | 0       |       |
| content     | text    |      | MUL |         |       |
+-------------+---------+------+-----+---------+-------+
4 rows in set (0.04 sec)

That's my little table. I'm sure it could be improved further.

Replies are listed 'Best First'.
Re^2: Creating Metadata from Text File
by Trihedralguy (Pilgrim) on Jul 20, 2007 at 18:59 UTC
    Storing pages and pages of data is not what I want to do. Per my application if I have the following "document" I will get the following metadata.
    Mary had a little lamb,
    Its fleece was white as snow;
    And everywhere that Mary went,
    The lamb was sure to go.

    He followed her to school one day;
    That was against the rule;
    It made the children laugh and play;
    To see a lamb at school.
    And so the teacher turned it out,
    But still it lingered near,
    And waited patiently about
    Till Mary did appear.
    "Why does the lamb love Mary so?"
    The eager children cry;
    "Why, Mary loves the lamb, you know,"
    The teacher did reply.

    Removing command words, repeating words, and punctuation you get: Mary little lamb fleece white snow followed school day rules made children laugh play teacher turned near waited patiently appear love eager children reply. 155 Characters with Spaces. from 460 Characters with spaces.

      You mentioned yourself that you want to be able to search text. Matching that text against the database, you want to find where that file is on disk.

      Imagine the following example; You have these words to lyrics in your mind.. you remember some part of the song goes .. "and the piano has been drinking not me". So you go to google and do an exact search for that string.

      So, by your example, loosely applied:

      #!/usr/bin/perl -w use strict; my $song = "The piano has been drinking My necktie's asleep The combo went back to New York, and left me all alone The jukebox has to take a leak Have you noticed that the carpet needs a haircut? And the spotlight looks just like a prison break And the telephone's out of cigarettes As usual the balcony's on the make And the piano has been drinking, heavily The piano has been drinking And he's on the hard stuff tonight The piano has been drinking And you can't find your waitress Even with the Geiger counter And I guarantee you that she will hate you From the bottom of her glass And all of your friends remind you That you just can't get served without her The piano has been drinking The piano has been drinking And the lightman's blind in one eye And he can't see out of the other And the piano-tuner's got a hearing aid And he showed up with his mother And the piano has been drinking Without fear of contradiction I say The piano has been drinking Our Father who art in ? Hallowed by thy glass Thy kindom come, thy will be done On Earth as it is in the lounges Give us this day our daily splash Forgive us our hangovers As we forgive all those who continue to hangover against us And lead us not into temptation But deliver from evil and someone you must all ride home Because the piano has been drinking And he's your friend not mine Because the piano has been drinking And he's not my responsibility The bouncer is this Sumo wrestler Kinda cream puff casper milk toast And the owner is just a mental midget With the I.Q. of a fencepost I'm going down, hang onto me, I'm going down Watch me skate across an acre of linoleum I know I can do it, I'm in total control And the piano has been drinking And he's embarassing me The piano has been drinking, he raided his mini bar The piano has been drinking And the bar stools are all on fire And all the newspapers were just fooling And the ash-trays have retired And I've got a feeling that the piano has been drinking It's just a hunch The piano has been drinking and he's going to lose his lunch And the piano has been drinking Not me, not me, The piano has been drinking not me"; my $word={}; my $words=[]; while( $song=~/[\W]*(\w+)[\W]*/g ){ unless(exists $word->{$1}){ push @$words, $1; } $word->{$1}++; } my $summary = join(' ', @$words); printf "original charcount: %s new charcount: %s\n words selected: %s\n\n", length $song, length $summary, $summary;

      Even before we take out stop words, etc, we get:

      original charcount: 2129
      new charcount: 1126
      
      words selected:
      The piano has been drinking My necktie s asleep combo
       went back to New York and left me all alone jukebox
       take a leak Have you noticed that the carpet needs 
      haircut And spotlight looks just like prison break 
      telephone out of cigarettes As usual balcony on make
      heavily he hard stuff tonight can t find your waitress
       Even with Geiger counter I guarantee she will hate
       From bottom her glass friends remind That get served
       without lightman blind in one eye see other tuner 
      got hearing aid showed up his mother Without fear 
      contradiction say Our Father who art Hallowed by thy 
      Thy kindom come be done On Earth as it is lounges 
      Give us this day our daily splash Forgive hangovers
       we forgive those continue hangover against lead 
      not into temptation But deliver from evil someone 
      must ride home Because friend mine my responsibility 
      bouncer Sumo wrestler Kinda cream puff casper milk
       toast owner mental midget With Q fencepost m going
       down hang onto Watch skate across an acre linoleum 
      know do total control embarassing raided mini bar 
      stools are fire newspapers were fooling ash trays 
      have retired ve feeling It hunch lose lunch Not
      

      So, as a human being, searching for this song against your database, I would not be able to find it quite so easily.

      What will be interacting with your data? A human being or a computer? I'm not trying to be a smart ass. But if you want a human being to use search for things against your database, the you are wrong; pages and pages of text is what you want- and all you did in your example was to turn 460 characters with spaces into 155 characters of junk.