sellwill has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am pretty new to perl (and programming), I have a script that is pretty much exactly what I need--the only problem I have is that it only returns the first match +/-10 chars. How can I modify this to return ALL matches in a document instead of just the first?

my $string = 'string to search for'; open (TEXTFILE, 'searchfile.txt'); $_ = join('',<TEXTFILE>); close (TEXTFILE) while (/(.{0,10})$string(.{0,10})/gis) { print "Found: $1$string$2\n"; }

Replies are listed 'Best First'.
Re: Return multiple matches in file
by ww (Archbishop) on Jun 14, 2011 at 00:35 UTC
    Yes, you're going about it the wrong way. Read the thread; read the tut; read the docs. Simply copying code won't teach you much (about which, more, below).

    The dot in your regex stands for any character, whatsoever. That's not what you want.

    And, * NOT * just BTW, the quantifier {0,10} makes no sense on several grounds.

    1. In valid (and useful) html, img alt="" src= will never be followed by zero characters.
    2. In most of the html I see, you can't count on getting the address link immediately after the alt description.
    3. Many webmonkies, me included, prefer to see the source address first (and don't usually leave the alternate description empty)
    4. Since you're apparently looking for the address of the image, you limit of 10 chars is unlikely to capture the whole address.

    I suspect what you've done here is "cargo culted" some code you didn't understand. I understand that that's part of one style for learning, but it's dangerous if you don't study the code and the Perl documentation well enough to be sure you * DO * understand before using it.

Re: Return multiple matches in file
by davido (Cardinal) on Jun 13, 2011 at 23:11 UTC

    Show us a few lines of the input file, containing at least two places where it should match. And show us the actual string you're trying to search for.

    Where did the code come from?


    Dave

      Hi Dave--thanks for getting back to me!

      I actually got the script off of this site (http://www.perlmonks.org/?node_id=98208)

      I am searching for: "img alt="" src=" and would like the script to then return the 10 characters after that.

      Below is a sample of the text I am going through. For this example, it will only return the first (<img alt="" src="http://www.xyz/12345.jpg"), and not any other matches. Am I going about this the correct way, or would some sort of wildcard search be easier?

      <img alt="" src="http://www.xyz/12345.jpg" original-title="lol&lt;p&gt +;&lt;span class='points-vkL2Q'&gt;497&lt;/span&gt;&nbsp;&lt;span clas +s='points-text-vkL2Q'&gt;points&lt;/span&gt; : 456 : 4 days&lt;/p&gt; +"></a><div class="hover"><div class="arrows"><div title="like" class= +"arrow up " data="vkL2Q" type="image"></div><div title="dislike" clas +s="arrow down " data="vkL2Q" type="image"></div></div></div></div><di +v id="mSHi8" class="post"><a href="/gallery/mSHi8"><img alt="" src="h +ttp://i.xyz.com/mSHi8b.jpg"

        At minimum change your string to $string = quotemeta 'String to search for';. quotemeta, which ensures that $string is interpreted as a literal string, rather than having the characters you're searching for be mistakenly seen as regexp special characters.

        But just a couple days ago I was playing with HTML::LinkExtor, and I have to say, I think it's a much more robust solution for what you're trying to accomplish. It wouldn't exactly fit your specification, but it would probably be such a nice solution that you would reconsider how to solve the greater problem.

        By the way, that code you found in Categorized Questions and Answers is... well, at best outdated. That's why I was asking. Embarrassing that it still sits there in Q&A. I took the liberty of updating the Q&A post with some more up to date code.


        Dave

Re: Return multiple matches in file
by ww (Archbishop) on Jun 14, 2011 at 02:32 UTC

    Here's another approach... also more in keeping with today's norms (be sure to review and understand davido's code in the Q&A section) and NOT reliant on assumptions about the order in which the coder constructed the image link (see Re: Return multiple matches in file).

    my @string = qq 'img src="'; push @string, qq 'alt="'; # We're going to do the matching <b>* ONE *</b> step at a time, for +reasons cited above. # Hence, the array (which could have been created all on one line) # which does NOT assume the order (in the source data) of the search + terms. open (my $INFILE, '<', 'htmFOR909469.html') or die "Can't open htmFOR9 +09469.html ", $!; my $text = join('',<$INFILE>); close ($INFILE); for my $string (@string) { while ( $text =~ /$string([^\"]*)/gis ) { # Regex matches and captures anything after $string that is +NOT a double quote print "Found: $1 --- Searchterm was: $string \n"; } }

      Thanks for all the suggestions guys!

      I know that I do need to spend some time understanding perl better, and not just standing on the shoulders of better programmers and using stuff I don't really understand.

Re: Return multiple matches in file
by jethro (Monsignor) on Jun 14, 2011 at 01:07 UTC

    The interesting thing with this old code is: it should work, i.e. it should find more than one occurence, but it doesn't. Somehow //g doesn't work when a variable is inserted into the regex, even though that variable isn't changed

    Anybody have a explanation for that?

Re: Return multiple matches in file
by planetscape (Chancellor) on Jun 15, 2011 at 06:34 UTC