perlaintdead has asked for the wisdom of the Perl Monks concerning the following question:

Greetings oh wise monks! I downloaded the whole of Wikipedia and i have built an index of every entry. I built the index with the following code

use strict; use warnings; open(WIKI, "<", "F:/wiki/enwiki-20130102-pages-articles.xml"); open(INDEX, "+<", "F:/wiki/wiki.index"); my $entry; my $title; while(<WIKI>){ if( (index $_,"<title>") > -1 ){ $title = $_; $title =~ s/.*?<title>//; $title =~ s/<\/title>.*?//; $entry = $title . "::" . $. . "\n"; syswrite INDEX, $entry; print "line ", $. , " : $title done\n"; } } close(INDEX); close(WIKI);

so each entry begins with line the title was found, "::", and then the title name. My question is how would i "jump" to a specific line without having to rifle through every line of the file. I am familiar with such things like Binary searches and would also like to implement search functionality (but that's not very relevant to the question)

any help would be appreciated.

update: The index just finished up and it ended up being almost half of a Gigabyte

update:Turns out i put the vars in the wrong places with the index code. no trubles. notepad++ has regexs

Replies are listed 'Best First'.
Re: Read specific line(s)
by MidLifeXis (Monsignor) on Dec 12, 2013 at 13:52 UTC

    Perhaps something using seek and tell would be helpful. You might want to use the file location (from tell) instead (or in addition to) the line number.

    --MidLifeXis

      Hmmm. Yes but i will have to make changes to the index as well as my approach. It's probably worth it tho. Thanks

Re: Read specific line(s)
by ww (Archbishop) on Dec 12, 2013 at 17:23 UTC

    "...the whole of Wikipedia"?!!!

    And, of course, you read and abided by this part of their Terms of Use:

    • 4. Refraining from Certain Activities
      • Engaging in Disruptive and Illegal Misuse of Facilities
        • Engaging in automated uses of the site that are abusive or disruptive of the services and have not been approved by the Wikimedia community;
        • Disrupting the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

    Quis custodiet ipsos custodes. Juvenal, Satires

      They provide gzipped files of the entire DB for download in many languages.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Missed that; my 'oops!'