I gather that the file named "enwikisource-20090621-pages-articles.xml" is the huge 2.4 GB thing. There are two basic problems with the OP code:

  1. You should be using an XML parsing module to read that file; I would recommend the fundamental XML::Parser, which offers plenty of simplicity and flexibility, including a "Stream" style of processing, so that the whole file doesn't need to be memory-resident all at one time.

  2. You should do one pass over the big file to build an index based on the contents of the "title" elements, so that each index entry stores the location (byte offset) and size (byte count) of the content associated with each title; then as you read your set of query inputs, search the index for matches to the question text; if there's a match, just seek to the corresponding byte offset, read the specified number of bytes from that offset, and process that content for presentation as the "answer".

Figure out what tag it is that contains both a "title" and the sequence of "paragraphs" associated with a title. For each end-tag of that type, output an index entry that says what the title is, and what the byte range is for the whole container element.

Searching the index for hits on a given query (and determining whether there are no hits at all) will be a lot quicker and more efficient than scanning the whole big 2.4 GB source file; you can probably load the entire index into memory at one time if you want (since it's only titles and byte offsets).

Then, seeking into the big file to a designated start point and reading a given number of bytes will also be very quick, and if your indexing step was done properly, this portion of the file will, by itself, be a well-formed xml string which you can parse in order to present some subset (e.g. just the first paragraph).

Use Super Search to look for nodes that show code using XML::Parser, and read its manual.

(update: other monks with broader experience in search-engine development can probably recommend modules that do a lot of the work for building and querying an index; e.g. you might want to look at KinoSearch. The OP approach to manipulating the query string seems rather coarse, and you could use some help with that part as well. And... what if the query doesn't match any title words, but does match some relevant terms in the paragraph data? Wouldn't a generic Lucene-style index and relevance search be more useful?)


In reply to Re: HUGE file poses risk to testing out code... need professional look-see by graff
in thread HUGE file poses risk to testing out code... need professional look-see by AI Cowboy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.