First off you need to define the problem a little better. What exactly do you want to index? Every word in the document? Every word in Headings? Some collection of key words (determined or predetermined in some fashion)?

When you have built your index, what do you want to do with it? Knowing that will dictate to some extent the data structures you need to store the index as you create it.

Once you have sorted out some of that stuff then you can start thinking about coding. At that point I'd take a good look at some of the HTML modules - HTML::TreeBuilder is a good starting point for this sort of task.


Perl is environmentally friendly - it saves trees

In reply to Re: Search on html files by GrandFather
in thread Search on html files by vsailas

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.