The info you appear to be seeking is all contained between the (nested) <table class="searchbox"> and the next-following </table> (in each page, of which there are many; iteration from page to page is left as an exercise for the student).
The only php in that table is in the first non-empty cell of each row starting with the second1 row, to wit:
<td><a href="./form.php?stage=3&search_total=http ... stage%3D2&connector_id=1018">More</a></td>
1 OT: Row one really should be in a <thead><tr><th>....<thead> construct.
Your sample output suggests that you don't need the detail produced by the php above-noted links there, so there's little reason to do more than capture only that table (in each of the roughly 1000 pages listing approx 30 US companies with relevant ISO certificates) and within each, strip the .html. You'll have a column with the word 'More' beginning each row, but that's scarcely the end of the world and simply cured with a substitution ( something not much more complex than s/^\tMore\t//; perhaps? UNtested, as even the source view of your sample does NOT tell me exactly what's in the start of your row, before and after 'More' ) after the html cleanup.
In reply to Re^5: Strip PHP page
by ww
in thread Strip PHP page
by bauer1sc
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |