in reply to Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
in thread how to quickly parse 50000 html documents?
Well, the nested tables are awkward and the use of various outdated or deprecated tags is unfortunate; the lack of quotes and the like can certainly be labeled "mistakes." But "appalling" is a pretty strong word. Perhaps "dated" or similar would be better.
...so bad as to be practically of no use.
Even harsher (and IMO, excessive), particularly since what we know about the html fails to support any inference that OP bears any responsibility.
There is, however, a valuable nugget that saves your post from a quick downvote -- the notion that future changes could break a regex solution. OTOH, any solution we can readily offer today would also be broken were the html converted to 100% compliant xml.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
by aquarium (Curate) on Nov 28, 2010 at 23:18 UTC |