Not to repeat too much of the above, there are many answers for this question. One approach that will work if the page you are pulling from is reliabley valid HTML is to pull the page down and use
to parse the page or specific nodes within the using twig_roots. I just completed some work like this. The only snag will be if the page you pull is not valid HTML, the parser could choke, I solved a majority of these errors by running through Tidy.