nysus has asked for the wisdom of the Perl Monks concerning the following question:

Let's say I have the following html that is an outline using header tags:

<h1>blah</h1> <p>blah<p> <h2>blah</h2> <p>blah</p> <h3>blah</h3> <p>blah</p> <h2>blah blah blah</h2> <p>blah</p> <h1>blah</h2> <h2>blah</h2> <h2>blah</h2>

I want to convert that to this, wrapping each h2 with a div:

<h1>blah</h1> <p>blah<p> <div> <h2>blah</h2> </div> <div> <p>blah</p> <h3>blah</h3> <p>blah</p> </div> <div> <h2>blah blah blah</h2> <p>blah</p> </div> <h1>blah</h1> <div> <h2>blah</h2> </div> <div> <h2>blah</h2> </div>
In other words, the algorithm must search for all h2 tags, then find the next sibling tag that is not an h1 tag or EOF, and wrap that section in a div tag. HTML::Element and HTML::TreeBuilder seem like the right tools for the job but this problem seems so common I'm wondering if there is something that works out of the box for dealing with HTML that is set up like an outline.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Wrapping HTML "sections" with a div
by haukex (Archbishop) on Oct 25, 2018 at 21:21 UTC

    Could you provide some input HTML that's valid, and explain by what rules the <h3> got wrapped up in a <div> along with the two <p>'s, instead of the <p> preceding the <h3> being wrapped along with the preceding <h2>?

    Anyway, I agree that Mojo::DOM is nice. Maybe this is something to get started with:

    use warnings; use strict; use Mojo::DOM; my $html = "<h1>a</h1>\n<p>b</p>\n<h2>c</h2>\n<p>d</p>\n" ."<h3>e</h3>\n<p>f</p>\n<h2>g</h2>\n<p>h</p>\n<h2>i</h2>\n" ."<h1>j</h1>\n<h2>k</h2>\n"; my $dom = Mojo::DOM->new($html); $dom->find('h2')->each(sub { my $next = $_->next; my $new = $_->wrap('<div></div>'); if ( $next && !$next->matches('h1') ) { $new->append($next); $next->remove; } }); print "$dom";
Re: Wrapping HTML "sections" with a div
by marto (Cardinal) on Oct 25, 2018 at 18:19 UTC

    Mojo::DOM makes things like this pretty simple.

    Update: fixed broken link, thanks Lotus1.

Re: Wrapping HTML "sections" with a div
by choroba (Cardinal) on Oct 26, 2018 at 11:15 UTC
    Had you provided valid input and output matching the algorithm, I could have given you an xsh answer, something like
    open :F html input.html ; for my $h2 in //h2 { my $h1 = $h2/following-sibling::h1[1] ; my $chunk ; if $h1 { $chunk = $h2/following-sibling::text()[following-sibling::h1[1 +]=$h1][preceding-sibling::h2[1]=$h2] | $h2/following-sibling::*[not(self::h2)][following-sib +ling::h1[1]=$h1][preceding-sibling::h2[1]=$h2] ; } else { $chunk = $h2/following-sibling::text()[preceding-sibling::h2[1 +]=$h2] | $h2/following-sibling::*[not(self::h2)][preceding-sib +ling::h2[1]=$h2] ; } if $chunk { my $div := insert element div after $h2 ; xmove $chunk into $div ; } } my $divs := wrap div //h2 ; save :F html :f output.html ;
    It gives the following output, which is different to the expected one, but I'm not sure all the differences are wrong:
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Wrapping HTML "sections" with a div
by tangent (Parson) on Oct 26, 2018 at 15:38 UTC
    Here's a way to do it by subclassing HTML::Parser. Even though it is not well documented, I like the fine-grained control this technique allows.
    use strict; use warnings; my $html = q| <h1>blah</h1> <p>blah<p> <h2>blah</h2> <p>blah</p> <h3>blah</h3> <p>blah</p> <h2>blah blah blah</h2> <p>blah</p> <h1>blah</h1> <h2>blah</h2> <h2>blah</h2>|; my $parser = Markdent_Parser->new(); $parser->parse($html); $parser->eof; print $parser->out; package Markdent_Parser; use parent qw(HTML::Parser); sub start { my ($self,$tag,$attr,$attrseq,$text) = @_; if ($tag eq 'h1' and $self->{'in_h2'}) { $self->{'out'} .= "</div>\n\n"; $self->{'in_h2'} = 0; } elsif ($tag eq 'h2') { if ($self->{'in_h2'}) { $self->{'out'} .= "</div>\n"; } $self->{'out'} .= "\n<div>\n"; $self->{'in_h2'} = 1; } $self->{'out'} .= $text; } sub text { my ($self,$text) = @_; $self->{'out'} .= $text; } sub end { my ($self,$tag,$text) = @_; $self->{'out'} .= $text; } sub out { my ($self) = @_; if ($self->{'in_h2'}) { $self->{'out'} .= "\n</div>"; } return $self->{'out'}; } 1;
    The output is as described by your algorithm rather than as shown by your example (I fixed the typo in your input also).
    <h1>blah</h1> <p>blah<p> <div> <h2>blah</h2> <p>blah</p> <h3>blah</h3> <p>blah</p> </div> <div> <h2>blah blah blah</h2> <p>blah</p> </div> <h1>blah</h1> <div> <h2>blah</h2> </div> <div> <h2>blah</h2> </div>
Re: Wrapping HTML "sections" with a div
by LanX (Saint) on Oct 25, 2018 at 17:56 UTC
    What did you try?

    If you don't want specialized modules , you can split on empty lines using $/ and regex-match a <h2> at the start.

    You should know how to do it.

    update

    Nope, the example input was misleading.

    My suggestion is: use an HTML module.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      I got this far before deciding I should see if I'm re-inventing the wheel:

      sub wrap { my $tree = HTML::TreeBuilder->new_from_content(shift); foreach my $h2 ( $tree->find_by_tag_name('h2') ) { my $parent = HTML::Element->new('div', 'data-role' => 'collapsible +'); $h2->replace_with($parent); $parent->push_content($h2); } return $tree->as_HTML; }

      The above only wraps the <h2> tag, not its associated content. The HTML is generated from markdown using Markdent so I think the proper thing to do is figure out how that works. Seems like it can do the job.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks