Hello fellow Monks!

I need to write simple notifier to be always up-to-date with certain threads replies
I am aware of WebService::Vichan, but i already started my HTML::Tokeparser::Simple approach, which i don't want to abandon yet.
I belive that using simplier tools i will learn more and whole code will be more effective than using Super::Duper::Module -> do_everything(\$data);

Given belown partial document file

<div class="post reply body-not-empty" id="reply_8735435"> (cut out for visibility) <p class="body-line ltr ">The first 3 lines were 15% bait power, but t +hen it fell to mere 5% and the last lines are literally 0%, try again + in a few days.</p> </div> (cut out for visibility) <div class="post reply body-not-empty" id="reply_8735439"> (cut out for visibility) <div class="body" > <p class="body-line ltr "> <a onclick="highlightReply('8735417', event);" href="/b/res/8735417.ht +ml#8735417">&gt;&gt;8735417</a> </p> <p class="body-line ltr quote">&gt;Reddit is a great place for discour +se and there are many active subreddits where field professionals reg +ularly answer questions on issues of health, science, engineering, et +c</p> <p class="body-line ltr ">Yeah, as far as content goes, Reddit kicks 8 +chan's ass. They have some great boards for serious academic discussi +on.</p> <p class="body-line empty ">

i want to iterate over "reply_xxx" id divs and once found i want to descend below to finally rip out whole body class div
Then, proceed to next reply-a-like div until EOF
Simple? nope :P

The issue i am running into is extistence of Tokeparse's cursor thingie, a state indicator which internally "knows" where in document parser actually is.
Using this

my $parser = HTML::TokeParser::Simple->new(\$data); while (my $div = $parser->get_tag('div','/div')) { my $id = $div -> get_attr('id'); next unless (defined $id and $id =~ /reply/); # tutaj kursor jest wewnatrz taga z odpowiedzia # wiec iteruje glebiej while ( my $inner_div = $parser -> get_tag('div','/div')) { my $inner_class = $inner_div -> get_attr('class'); next unless (defined $inner_class and $inner_class eq 'body'); #~ # print "div.$id > div.$inner_class \n"; my $text = $parser -> get_text; print "$id: '$text' \n"; #~ # print $id ." "; } }

gives a result where only first ID is matched and inner while loop iterates over all replies' bodies until EOF
Obviously it's not what i am after :)
my first though was to isolate content of rest of HTML document after matching "reply id", run inner while until first closing div, then feed outer while with not-already-consumed document's data and do it until actual EOF

as you can see, it seems uneffective in first thought.

How do Monks would hande this task? By "rewinding" internal cursor using unget_token method? Tokeparser is not a must, i am open to other solutions, but it's welcome.


In reply to best approach to parse vichan-style imageboard by lis128

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.