We don't bite newbies here... much | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
I gather that the file named "enwikisource-20090621-pages-articles.xml" is the huge 2.4 GB thing. There are two basic problems with the OP code:
Figure out what tag it is that contains both a "title" and the sequence of "paragraphs" associated with a title. For each end-tag of that type, output an index entry that says what the title is, and what the byte range is for the whole container element. Searching the index for hits on a given query (and determining whether there are no hits at all) will be a lot quicker and more efficient than scanning the whole big 2.4 GB source file; you can probably load the entire index into memory at one time if you want (since it's only titles and byte offsets). Then, seeking into the big file to a designated start point and reading a given number of bytes will also be very quick, and if your indexing step was done properly, this portion of the file will, by itself, be a well-formed xml string which you can parse in order to present some subset (e.g. just the first paragraph). Use Super Search to look for nodes that show code using XML::Parser, and read its manual. (update: other monks with broader experience in search-engine development can probably recommend modules that do a lot of the work for building and querying an index; e.g. you might want to look at KinoSearch. The OP approach to manipulating the query string seems rather coarse, and you could use some help with that part as well. And... what if the query doesn't match any title words, but does match some relevant terms in the paragraph data? Wouldn't a generic Lucene-style index and relevance search be more useful?) In reply to Re: HUGE file poses risk to testing out code... need professional look-see
by graff
|
|