in reply to (crazyinsomniac) Re: Extract info from HTML
in thread Extract info from HTML
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: (crazyinsomniac) Re: Extract info from HTML
by demerphq (Chancellor) on Nov 12, 2001 at 15:01 UTC | |
Mine will extract all the above information just change the following lines Then you can extract whatever you want. Note that the depths are as follows:9 root node, 12 reply, 13, reply to a reply... But a thought: You dont want the posts from just a fixed depth in the parse tree. That would for instance eliminate you from the list (you dont have a reply to yourself) as well as anyone who explained their name in a reply to another persons explaination, merphq would be an example, however I believe there are more as well. Actually, one of the more interesting issues with this thread was acurately picking up all names from all levels, there is an annoying habit of <UL> tags messing up the pattern, also of the main post being marked up differently. Anyway, Ill revisit this a bit later, :-)
Yves / DeMerphq | [reply] [d/l] [select] |
by George_Sherston (Vicar) on Nov 12, 2001 at 15:09 UTC | |
§ George Sherston | [reply] |
Re: Re: (crazyinsomniac) Re: Extract info from HTML
by crazyinsomniac (Prior) on Nov 12, 2001 at 15:27 UTC | |
and the output japhy| tilly| ichimunki| runrig| demerphq| shotgunefx| Masem| synapse0| agent00013| MrNobo1024| Corion| Zaxo| idnopheq| dragonchild| herveus| wine| TheoPetersen| toadi| dga| mexnix| cadfael| buckaduck| ybiC| {NULE}| theorbtwo| Jouke| gregor42| Guildenstern| sifukurt| CubicSpline| jackdied| suaveant| poqui| mikeB| davis| s173451000| PotPieMan| mr_mischief| earthboundmisfit| kwoff| Arguile| chaoticset| BrentDax| Aighearach| basicdez| brianarn| BooK| riffraff| seanbo| Maestro_007| stefan k| dthacker| Hero Zzyzzx| beretboy| Veachian64| giulienk| blakem| Chmrr| or sortedAighearach| Arguile| BooK| BrentDax| Chmrr| Corion| CubicSpline| Guildenstern| Hero Zzyzzx| Jouke| Maestro_007| Masem| MrNobo1024| PotPieMan| TheoPetersen| Veachian64| Zaxo| agent00013| basicdez| beretboy| blakem| brianarn| buckaduck| cadfael| chaoticset| davis| demerphq| dga| dragonchild| dthacker| earthboundmisfit| giulienk| gregor42| herveus| ichimunki| idnopheq| jackdied| japhy| kwoff| mexnix| mikeB| mr_mischief| poqui| riffraff| runrig| s173451000| seanbo| shotgunefx| sifukurt| stefan k| suaveant| synapse0| theorbtwo| tilly| toadi| wine| ybiC| {NULE}|update: riight, but like I said, i'm deliberately matching only replies of depth 1, which all do conform (only 2nd level replies got the ul bug, and If i was parsing them, I'd just have the improper html in there regardless). I saw what you did ;D | [reply] [d/l] |
by demerphq (Chancellor) on Nov 12, 2001 at 16:07 UTC | |
Note the buggy HTML? :-) So what I did was look for the content of the FONT tag. If it matches a 'finger print' for one of the following two Then I do a few more checks to make sure it isnt a spurious match, if they pass then I consider it the title/author/date of the node. A bit of extraction of the tags attributes and presto, we have the home node and post node ids. (With the exception of the main post, where we can only extract the title, not the ID) This would be sooooo much easier if there were class attributes in the tags, such as <TD class="post">, but considering the buggy HTML, I suppose class attributes are low on the priority list. (BTW, cant wait to join the PM dev team, id like to have a crack at cleaning up some of the HTML, now that im getting into parsing it :-)
Yves / DeMerphq | [reply] [d/l] [select] |