Hello Monks, Im currently trying to strip the contents of a php web page. I started out trying to print out the contents of the page im dealing with getting the results of
<html> <head> <link rel="stylesheet" href="style.css" type="text/css"> <meta name="generator" content="Bluefish"> <meta http-equiv="content-type" content="text/html;charset=iso +-8859-1"> <title>Who's Registered - Find Companies Registered to ISO 900 +0, ISO 14000 and/or related sector-specific standards</title> <meta name="keywords" content="ISO 9000 ISO 9001:1994 ISO 9001 +:2000 ISO 9002:1994 9003:1994 AS 9000 TL 9000 ISO/TS TE Supplement AS + 9100 EN 46001 ISO 13485 RC 14001 ISO 14001:1996 ISO 14001:2004 OHSAS + 18001 ISO 14001"> <meta name="description" content="WhosRegistered.com Global is + the worlds largest free global listing of certified suppliers to ISO + 9000, ISO/TS 16949, TL 9000, AS9100 and ISO 14001 anywhere in the wo +rld."> </head> <body bgcolor="#ffffff" leftmargin="0" marginwidth="0" topmargin=" +0" marginheight="0" link="yellow" vlink="yellow"> <table border="0" cellpadding="3" cellspacing="0" width="1 +00%" align="center"> <tr height="120"> <!--<td colspan="2" valign="top" align="left" heig +ht="120" class="logoarea"><a href="http://www.whosregistered.com/"> < +img src="images/dartboard.gif" border="0"> </a></td>--> <td colspan="1" valign="top" align="left" height=" +120" class="logoarea"><a href="http://www.whosregistered.com/"> <img +src="images/dartboard.gif" border="0"> </a></td> <td colspan="1" valign="middle" align="center" hei +ght="120" class="logoarea"><a href="http://www.whosregistered.com/plu +gins/phpAdsNew/click.php?bannerID=7"><img src="http://www.whosregiste +red.com/plugins/phpAdsNew/viewbanner.php?bannerID=7" width=468 height +=60 alt="Who's Who in China" border=0></a> </td> <readmore> <td height="120" class="logoarea" align="right"><a + href="http://www.qsuonline.com/cart/DirectoriesSoftware.html#9KRCDca +rt" target="_blank"><img src="images/RCDad.jpg" border="0"></a></td> </tr> <tr height="25"> <td height="25" valign="top" align="left" width="1 +20" class="topthinline"><img src="images/corner.jpg" width="25" heigh +t="25" border="0" class="topthinline2"></td> <td width="190" height="25" class="topthinline"></ +td> <td height="25" class="topthinline" align="right"> <!--Number of records in the database: --></td> <!-- </tr> --> <tr> <td width="120" valign="top" align="center" class= +"menu"> <table border="0" cellpadding="6" cellspacing= +"0" width="120" class="smallfont" style="font-family: verdana,arial;" +> <tr> <td align="right" width="20"></td> <td><a href="http://www.whosregistered +.com/">Home</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php" target="_new">Using WhosRegistered.com Global</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/press.php" target="_new">News Releases</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.qsuonline.com/ +SubmissionInstructions/MainPage.html" target="_new">Information For R +egistrars/Certification Bodies</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#certs" target="_new">Management System Certificati +on</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#iso9000" target="_new">ISO 9000</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#iso14001" target="_new">ISO 14001</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#sector" target="_new">Sector Programs</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#monitor" target="_new">Supplier Monitoring</a></td +> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#feedback" target="_new">Supplier Feedback</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#submit" target="_new">Submitting Data</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#corrections" target="_new">Correcting Data</a></td +> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#accred" target="_new">Accreditation Marks</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#tips" target="_new">Tips for Purchasing Agents</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#roi" target="_new">Return on Investment Survey</a> +</td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#registrars" target="_new">Find Registrars</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#journals" target="_new">Professional Journals</a>< +/td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#booksvideosoftware" target="_new">Books, Videos, S +oftware</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.whosregistered +.com/iso/intro.php#minisearch" target="_new">Add WhosRegistered.com G +lobal to Your Website</a></td> </tr> <tr> <td width="20" align="right" valign="t +op"></td> <td><a href="http://www.qsuonline.com/ +BodyPages/Aboutus.html" target="_new">About QSU Publishing</a></td> </tr> <tr height="15"> <td width="20" height="15"></td> <td height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right" height="15"> <td align="left" width="20" height="15 +"></td> <td align="left" height="15"></td> </tr> <tr align="right"> <td align="left" width="20"></td> <td align="right" valign="bottom"></td +> </tr> <tr align="right"> <td align="left" width="20"></td> <td align="left"></td> </tr> </table> <p></p> </td> <td rowspan="2" colspan="2" valign="top" align="le +ft" class="mainbox"><img src="images/corner2.jpg" width="25" height=" +25" border="0" class="corner2"><br> <div align="center" width="675"> <p><center><h3>Welcome to WhosRegistered.com G +lobal</h3></center></p> <table width="550"> <tr><td> <p class="mainbox">The worlds largest free glo +bal listing of certified suppliers to ISO 9000, ISO/TS 16949, TL 9000 +, AS9100 and ISO 14001 anywhere in the world. Search by company name, + location even products and services listed in the scope of certific +ation. WhosRegistered.com takes the hassle out of finding certified c +ompanies.</p> </td></tr> </table> <table cellpadding="0" cellspacing="0" + border="0" bgcolor="#FFFFFF" width="650"> <tr height="30"> <td height="30" colspan=5> </td> </tr> <tr height="30"> <td width="30" height="30"></td> <td width="30" height="30" class="searchbox"> <img src="images/search-topleft.jpg" width="30" height +="30" border="0"> </td> <td height="30" class="searchbox" valign="abstop" align="r +ight"> </td> <td width="30" height="30" valign="abstop" align="right" c +lass="searchbox"> <img src="images/search-topright.jpg" width="30" heigh +t="30" class="searchboximage"> </td> <td width="30" height="30"></td> </tr> <tr> <td width="30"></td> <td width="30" class="searchbox"> </td> <td valign="middle" class="searchbox"> <!-- stage 2 --><br> <div align="center">106558 records found </div><br> <div align="center">You are on page 1 out of 3552 +total pages<br>&nbsp;<a href="./form.php?Company=&city=&sp=&country=U +nited+States&certificate_number=&Scope=&registrar_secret=&begin=0&sta +ge=2"><<</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=-60&stage=2"><</a>&nbsp; <! start page 1 end page 11 -->1&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=30&stage=2">2</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=60&stage=2">3</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=90&stage=2">4</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=120&stage=2">5</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=150&stage=2">6</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=180&stage=2">7</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=210&stage=2">8</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=240&stage=2">9</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=270&stage=2">10</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=300&stage=2">11</a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=-30&stage=2">></a>&nbsp; <a href="./form.php?Company=&city=&sp=&country=United+States&certifica +te_number=&Scope=&registrar_secret=&begin=106530&stage=2">>></a>&nbsp +; <table class="searchbox"> <tr> <td width="10"></td> <td width="20"></td> <td width="10"></td> <td width="100"><b>Company</b></td> <td width="10"></td> <td width="50"><b>City</b></td> <td width="10"></td> <td width="50"><b>State or Province</b></td> <td width="10"></td> <td width="30"><b>Country</b></td> <td width="10"></td> <td width="30"><b>Certificate Number</b></td> <td width="10"></td> .... continues simulary till end of page (30 entries)
I used HTML::Strip; my $hs = HTML::Strip->new(); my $page = $pageCheck->content; my $clean_text = $hs->parse( $page ); print $clean_text; and this is my output
Home Using WhosRegistered.com Global News Releases Information For Registrars/Certificati +on Bodies Management System Certification ISO 9000 ISO 14001 Sector Programs Supplier Monitoring Supplier Feedback Submitting Data Correcting Data Accreditation Marks Tips for Purchasing Agents Return on Investment Survey Find Registrars Professional Journals Books, Videos, Software Add WhosRegistered.com Global to Your Website About QSU Publishing Welcome to WhosRegistered.com Global 106558 records found You are on page 1 out of 3552 total pages
it doesn't contain any of the entrys I am actually after. Any ideas on why this may be happening or a possible solution? Thanks again monks for your help!

In reply to Strip PHP page by bauer1sc

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.