Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Creating Metadata from Text File

by poqui (Deacon)
on Jul 20, 2007 at 18:00 UTC ( [id://627839]=note: print w/replies, xml ) Need Help??


in reply to Creating Metadata from Text File

You mention the "limit i think for an oracle varchar2 table is at least 5000 bytes"; I think you mean that a single Varchar2 column can hold about 4000 1 byte characters (less if using multibyte).

Does that mean you are going to store the entire word list you are generating in a single Oracle column on a table? That could be a very large table... and searching against a freeform column (which is why I assume you are putting this data into an Oracle table) is *extemely* slow.

What is the format for your metadata, ultimately? Are you attempting to encode RDF or RDFS or something else?

Replies are listed 'Best First'.
Re^2: Creating Metadata from Text File
by Trihedralguy (Pilgrim) on Jul 20, 2007 at 18:47 UTC
    I'm indexing PDFs for a quick search of all of our PDF documentation. I will be sticking them into one column, but I hope none of the columns will come close to the 5000 max because of the fact that I'm doing all this "elimination' of common words, and duplicate words. I may eventually even just limit it to like the first x number of words as I feel if you are looking for a specific document about say apples, the word apples is going to appear withing the first couple of paragraphs at least.
    Do you have any other suggestions rather than going this route?
    Ulitmatly I'm just indexing the PDFs so that I can repoint back to them later. PDF is a good format for storing massive amounts of documentation, I'm just providing the ability to search all of them at once.
      Yours sounds like an adequate "brute force" method; but if you have the time, you should take a look at RDF (Resource Desciption Format) which is the standard for metadata about documents and other things that a library might consider a "Resource"; its being extended to encompass other things as well; like code and databases; but it started right where you are at now.

      I suggest it because there are tools to search RDF for matching resources, based on subject and meaning, rather than just the appearance of certain words.
        While I haven't gone looking quite yet, do you know if these other RDF solutions are perl driven.
        I'm trying to do it the "brute force" method because we need something quick, easy, and something that can be completely automated. I will have to at least look into this RDF you speak of.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://627839]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-20 12:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found