Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I gather that the file named "enwikisource-20090621-pages-articles.xml" is the huge 2.4 GB thing. There are two basic problems with the OP code:

  1. You should be using an XML parsing module to read that file; I would recommend the fundamental XML::Parser, which offers plenty of simplicity and flexibility, including a "Stream" style of processing, so that the whole file doesn't need to be memory-resident all at one time.

  2. You should do one pass over the big file to build an index based on the contents of the "title" elements, so that each index entry stores the location (byte offset) and size (byte count) of the content associated with each title; then as you read your set of query inputs, search the index for matches to the question text; if there's a match, just seek to the corresponding byte offset, read the specified number of bytes from that offset, and process that content for presentation as the "answer".

Figure out what tag it is that contains both a "title" and the sequence of "paragraphs" associated with a title. For each end-tag of that type, output an index entry that says what the title is, and what the byte range is for the whole container element.

Searching the index for hits on a given query (and determining whether there are no hits at all) will be a lot quicker and more efficient than scanning the whole big 2.4 GB source file; you can probably load the entire index into memory at one time if you want (since it's only titles and byte offsets).

Then, seeking into the big file to a designated start point and reading a given number of bytes will also be very quick, and if your indexing step was done properly, this portion of the file will, by itself, be a well-formed xml string which you can parse in order to present some subset (e.g. just the first paragraph).

Use Super Search to look for nodes that show code using XML::Parser, and read its manual.

(update: other monks with broader experience in search-engine development can probably recommend modules that do a lot of the work for building and querying an index; e.g. you might want to look at KinoSearch. The OP approach to manipulating the query string seems rather coarse, and you could use some help with that part as well. And... what if the query doesn't match any title words, but does match some relevant terms in the paragraph data? Wouldn't a generic Lucene-style index and relevance search be more useful?)


In reply to Re: HUGE file poses risk to testing out code... need professional look-see by graff
in thread HUGE file poses risk to testing out code... need professional look-see by AI Cowboy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-29 10:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found