in reply to Re: Strip PHP page
in thread Strip PHP page

The entries appear on the site as a table :
Company City State or Province Country + Certificate Number More ZF Boge Elastmetall Hebron Kentucky + United States CERT-03170-2004-AE-HOU-RAB More YKK AP America, Inc. Dublin Georgia + United States CERT-05094-2003-AE-HOU-RAB More Yamaha Motor Mfg. Corp. of America Newnan + Georgia United States CERT-03164-2003-AE-HOU-RAB + More Xycom Automation Saline Michigan + United States CERT-07076-2004-AE-HOU-RAB More Wiltech of Florida Corp., Inc. Kennedy Space +Center Florida United States CERT-05042-2003- +AE-HOU-RAB More Weir Floway, Inc. Fresno California + United States CERT-07585-2004-AE-HOU-ANABR1 More Weastec, Inc. Greenfield Ohio + United States CERT-05010-2004-AE-HOU-ANAB, R1 More Weastec, Inc. Hillsboro Ohio +United States CERT-03152-2004-AE-HOU-ANAB, R1 More Weastec, Inc. Seaman Ohio Uni +ted States CERT-05011-2004-AE-HOU-ANABR1 More Wartsila North America Ft.Lauderdale +Florida United States CERT-10118-2005-AE-HOU-ANAB + More Wartsila North America Harvey Louisia +na United States CERT-06437-2004-AE-HOU-ANABR2 More Wabash Technologies - Huntington Huntington + Indiana United States CERT-04595-2003-AE-HOU-R +ABR1 More VITRUS Pawtucket Rhode Island + United States CERT-07357-2004-AE-HOU-RAB More UT MD Anderson Bastrop Texas +United States CERT-05001-2004-AE-HOU-RAB More Vishay Siliconix Santa Clara Californ +ia United States CERT-03259-2004-AE-HOU-RAB More Veolia Water North America - Cerntral Pontiac + Michigan United States CERT-05785-2003-AE-HO +U-ANAB More UT MD Anderson Houston Texas +United States CERT-03592-2004-AE-HOU-RAB More Tyco Electronics M/A-Com, Inc. Lowell + Massachusetts United States CERT-04000-2005-AQ-HOU-A +NAB More Trigen/Cinergy Solutions Lansing Mich +igan United States CERT-04052-2005-AE-HOU-ANAB More Trefilarbed Arkansas, Inc. Pine Bluff + Arkansas United States CERT-10661-2005-AE-HOU-ANAB + More Transition Networks, Inc. Eden Prairie + Minnesota United States CERT-02683-2003-AE-HOU-RAB + More Toyota North American Parts Center - KY Hebro +n Kentucky United States CERT-06550-2004-AE-H +OU-RAB More Trace Die Cast, Inc. Bowling Green Ke +ntucky United States CERT-06234-2003-AE-HOU-RAB More Toyota Motor Sales, U.S.A., Inc. Ontario + California United States CERT-04245-2005-AE-HOU-A +NAB More Toyota Motor Sales, USA West Caldwell + New Jersey United States CERT-06180-2003-AE-HOU-ANAB + More Toyota Motor Sales, U.S.A., Inc. Torrance + California United States CERT-04246-2005-AE-HOU- +ANAB More Toyota Motor Sales USA, Inc. San Ramon + California United States CERT-03294-2004-AE-HOU-RAB + More Toyota Motor Sales U.S.A., Inc. Cincinnati + Ohio United States CERT-03419-2004-AE-HOU-ANAB, + R1 More Toyota Motor Sales U.S.A., Inc. Mansfield + Massachusetts United States CERT-02611-2003-AE-H +OU-RAB More Toyota Motor Sales U.S.A., Inc. Aurora + Illinois United States CERT-06876-2004-AE-HOU-RAB

Replies are listed 'Best First'.
Re^3: Strip PHP page
by marto (Cardinal) on Aug 06, 2007 at 18:13 UTC
    Is this table you speak of in a frame (or an iframe) or written out by JavaScript (document.write or such). Since you didn't simply give us the URL or the complete HTML (in read more tags) it is difficult to help. I am guessing you are using WWW::Mechanize to get this page, since you did not tell us exactly what you are doing here you have not made it easy for people to help you. See the PerlMonks FAQ and How do I post a question effectively?.

    Martin
      Sorry Martin if my question was unclear. I am using WWW::Mechanize to get the contents of the website and then using the HTML::Strip to ride the html tags. The url is
      http://www.whosregistered.com/iso/form.php
      only option I want to specify is country United States. Do the search and the results are in the table. Thanks again for your help

        The info you appear to be seeking is all contained between the (nested) <table class="searchbox"> and the next-following </table> (in each page, of which there are many; iteration from page to page is left as an exercise for the student).

        The only php in that table is in the first non-empty cell of each row starting with the second1 row, to wit:

        <td><a href="./form.php?stage=3&search_total=http ... stage%3D2&connector_id=1018">More</a></td>

        1 OT: Row one really should be in a <thead><tr><th>....<thead> construct.

        Your sample output suggests that you don't need the detail produced by the php above-noted links there, so there's little reason to do more than capture only that table (in each of the roughly 1000 pages listing approx 30 US companies with relevant ISO certificates) and within each, strip the .html. You'll have a column with the word 'More' beginning each row, but that's scarcely the end of the world and simply cured with a substitution ( something not much more complex than s/^\tMore\t//; perhaps? UNtested, as even the source view of your sample does NOT tell me exactly what's in the start of your row, before and after 'More' ) after the html cleanup.