The info you appear to be seeking is all contained between the (nested) <table class="searchbox"> and the next-following </table> (in each page, of which there are many; iteration from page to page is left as an exercise for the student).

The only php in that table is in the first non-empty cell of each row starting with the second1 row, to wit:

<td><a href="./form.php?stage=3&search_total=http ... stage%3D2&connector_id=1018">More</a></td>

1 OT: Row one really should be in a <thead><tr><th>....<thead> construct.

Your sample output suggests that you don't need the detail produced by the php above-noted links there, so there's little reason to do more than capture only that table (in each of the roughly 1000 pages listing approx 30 US companies with relevant ISO certificates) and within each, strip the .html. You'll have a column with the word 'More' beginning each row, but that's scarcely the end of the world and simply cured with a substitution ( something not much more complex than s/^\tMore\t//; perhaps? UNtested, as even the source view of your sample does NOT tell me exactly what's in the start of your row, before and after 'More' ) after the html cleanup.


In reply to Re^5: Strip PHP page by ww
in thread Strip PHP page by bauer1sc

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.