in reply to Find a specific word in a text file

Try to be more specific and detailed about what you need. "move forward or backward to grab another word nearby analysing character by character" could mean a lot of things, and the implementation will change depending on what it is that actually means.

However, lets assume that you've got the entire webpage in $page:

if ( ( my $position = index $page, "word" ) >= 0 ) { print "Found at $position.\n"; }

Once you've told us what you mean by that second part to the question we can help you to figure out how to "go forward or backward". ...it may be that the whole thing belongs in a regexp anyway.


Dave

Replies are listed 'Best First'.
Re^2: Find a specific word in a text file
by algonquin (Sexton) on Sep 09, 2004 at 08:02 UTC
    Thanks for the help Dave. I've tried it
    use LWP::Simple; my $page = get("http://www.google.de"); if ( ( my $position = index $page, "font-family" ) >= 0 ) { print "Found at $position.\n"; }
    but did not produce any results even though the souce from the page contains:
    </title><style><!--body,td,a,p,.h{font-family:arial,sans-serif;}

      Be sure to do as is described in the POD for LWP::Simple: Check your return values for success.

      my $page = get("http://www.whoever.de"); die "Couldn't get the page!\n" unless defined $page;

      It could be that you're not even succeeding in fetching the page. Next, dump $page into a text file where you can examine it later to see if it really contains the text you're looking for (without any line-breaks, etc). If your HTML parsing needs get fairly elaborate, you might want to look at HTML::TokeParser anyway; use a powertool when a powertool is needed.


      Dave

        Thanks for your help Dave. TokeParser was a great Idea. It took a while but I got it working now. Thx.
Re^2: Find a specific word in a text file
by algonquin (Sexton) on Sep 09, 2004 at 08:39 UTC
    OK Dave now it works: Found at 5477. Now I need to move some blank spaces and 4 letters forward to grab five digits. Any ideas?
      davido has given you good advice. I would go further and say _don't_ parse HTML even in apparently straightforward cases.

      There is often (always?) shed loads of arbitray white space which can easily defeat a regex. The HTML can be 'loose', 'strict' and change every day!

      Once you've used HTML::TokeParser (if I can, anybody can!) you'll be able to reuse the code in any future apps.

      Have a look at this tutorial. There is an example here. Search and Supersearch will find many more.

      It does seem like a lot of trouble if you are in a hurry! But I assure you the effort will pay dividends.

      Best of luck, wfsp

        Thanks. I did just that (I used tokeparser). It took a while but its worth it. Thx.

      I wish you had just gone ahead and asked the whole question at once as we requested. Breaking a single question into pieces and only feeding us one piece at a time might seem to be a good approach to you, but trust me, we can take bigger bites. We don't want to write your script for you, but if we're going to answer a question, at least let us answer the complete question.

      You really should be using HTML::TokeParser. Nevertheless, the following will use a fragile regexp to find a keyword, and grab the digits that immediately follow it (whitespace optional).

      if( $page =~ m/keyword\s+(\d+)/ ) { print "Found the keyword, and retrieved a value of $1\n"; }

      Now if your HTML has multiple instances of this keyword, you'll have to ask us another question, or read perlretut and perlrequick.

      By the way, if you're screen-scraping Google you are violating their Terms of Service, and exposing yourself to civil liability. From the Google Terms of Service page:

      No Automated Querying
      You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:

      • using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries;
      • "meta-searching" Google; and
      • performing "offline" searches on Google.

      Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.


      Dave

      Unless I'm mistaken, it appears that what you're trying to do can be solved with a fairly simple regular expression. Unless you actually need the index, try:
      if($page =~ m/\bword\b\s+.{4}(\d{5})/) { print "The number is $1"; } else { print "No match"; }
      From left-to-right, the expression states:

      \bword\b - Find the text 'word'
      \s+ - Followed by one or more whitespace characters
      .{4} - Followed by any four characters
      (\d{5}) - Followed by five digits

      The parenthesis around the '\d{5}' instruct perl to store this match in the variable $1 (for the purposes of this discussion). So the matched digits are stored in $1.