coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:
For some reason that's not matching with this data...while($google_results =~ m|<p\sclass=g><a href=(http://www\..+)\son +mousedown|gs) { push (@links_found, $1); }
As you can see I am going through a lot of junk just to get all the URLs from the search page. Can someone help tweak my regex a bit?<p class=g><a href=http://www.ets.org/toefl/ onmousedown="return clk(t +his,'res',1)">Welcome to TOEFL: The <b>Test</b> of English as a Forei +gn Language</a><br><font size=-1>Information about the TOEFL tests an +d services are available online. Try the<br> TOEFL practice questions.<br><font color=#008000>www.ets.org/toefl/ - + 18k - </font><nobr> <a class=fl href="http://64.233.161.104/search? +q=cache:Gq-KV5uuj6YJ:www.ets.org/toefl/+test&hl=en">Cached</a> - <a c +lass=fl href="/search?hl=en&lr=&safe=off&q=related:www.ets.org/toefl/ +">Similar pages</a></nobr></font> <blockquote class=g><p class=g +><a href=http://www.ets.org/testcoll/ onmousedown="return clk(this,'r +es',2)">The ETS <b>Test</b> Collection includes an extensive library +of 20000 <b>...</b></a><br><font size=-1>The ETS <b>Test</b> Collecti +on includes an extensive library of 20000 tests and other<br> measurement devices from the early 1900s to the present.</<br><font + color=#008000>www.ets.org/<b>test</b>coll/ - 11k - </font><nobr> < +a class=fl href="http://64.233.161.104/search?q=cache:mY1iJUWuYoEJ:ww +w.ets.org/testcoll/+test&hl=en">Cached</a> - <a class=fl href="/searc +h?hl=en&lr=&safe=off&q=related:www.ets.org/testcoll/">Similar pa +ges</a></nobr></font> </blockquote><p class=g><a href=http://www.test +.com/ onmousedown="return clk(this,'res',3)">
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: page parsing regex
by gellyfish (Monsignor) on May 12, 2005 at 08:57 UTC | |
|
Re: page parsing regex
by tlm (Prior) on May 12, 2005 at 08:12 UTC | |
by Anonymous Monk on May 12, 2005 at 08:34 UTC | |
by tlm (Prior) on May 12, 2005 at 09:08 UTC | |
|
Re: page parsing regex
by Animator (Hermit) on May 12, 2005 at 08:28 UTC |