Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Parsing string with tags

by Dirk80 (Pilgrim)
on May 31, 2022 at 10:42 UTC ( [id://11144309]=perlquestion: print w/replies, xml ) Need Help??

Dirk80 has asked for the wisdom of the Perl Monks concerning the following question:

I have a string which can contain several tags and I have an input list of tag names. Content which is not between tags shall always be kept. Content within a tag shall only be kept if the tag is active, i.e. is in the given tag list. Attention: tags can be nested.

my $str = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> w +ord6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11'; ## Examples: # @tags = ('tag0', 'tag1'): 'word1 word2 word3 word4 word5 word6 word9 + word11' # @tags = ('tag3'): 'word1 word6 word9 word10 word11' # @tags = ('tag1', 'tag2', 'tag3'): 'word1 word6 word7 word8 word9 wor +d10 word11'

Dependent on the given tags you can see the desired results in the commented examples above.

Now I ask you because I'm confused how to solve it. It is neither XML nor is it HTML. It's a string with tags. Would you recommend regular expressions or any module from CPAN? It would be very kind of you if you could give me some advice. Thank you very much!

And it would be cool if the string would be denied if the tagging is invalid, e.g. if a start tag has no end tag or something like that. But this would be bonus.

Replies are listed 'Best First'.
Re: Parsing string with tags
by haukex (Archbishop) on May 31, 2022 at 11:47 UTC

    You said "It is neither XML nor is it HTML", but at least the sample you showed is a valid XML document if parsed as a fragment, so that's what I did below. If your data is ever going to not be valid XML, you'd have to show an example of that. Mojo::DOM may still be able to parse it in that case while XML::LibXML is generally not very forgiving.

    use warnings; use strict; # ##### Version with Mojo::DOM ##### # nicer API but less strict on parsing (useful for HTML) use Mojo::DOM; sub strip_tags_mojo { my ($str, $activetags) = @_; my %active = map {($_=>1)} @$activetags; my $dom = Mojo::DOM->new->xml(1)->parse($str); $dom->find('*')->grep(sub{ not $active{$_->tag} })->map('remove'); my $text = $dom->all_text; $text =~ s/\s+/ /g; return $text; } # ##### Version with XML::LibXML ##### # generally a more verbose API but strict XML-compliant parsing # (in the following I've tried to shorten the code quite a bit) use XML::LibXML; sub strip_tags_xml { my ($str, $activetags) = @_; my %active = map {($_=>1)} @$activetags; my $dom = XML::LibXML->new->parse_balanced_chunk($str); $active{$_->nodeName} or $_->parentNode->removeChild($_) for $dom->findnodes('*'); my $text = $dom->textContent; $text =~ s/\s+/ /g; return $text; } # ##### Tests ##### use Test::More; sub exception (&) { eval { shift->(); 1 } ? undef : ($@ || die) } # po +or man's Test::Fatal my $input = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> + word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11'; my @tests = ( { in=>['tag0', 'tag1'], exp=>'word1 word2 word3 word4 word5 word6 +word9 word11' }, { in=>['tag3'], exp=>'word1 word6 word9 word10 word11' }, { in=>['tag1', 'tag2', 'tag3'], exp=>'word1 word6 word7 word8 word +9 word10 word11' }, ); plan tests => (@tests+1)*2; for my $t (@tests) { is strip_tags_mojo($input, $t->{in}), $t->{exp}; is strip_tags_xml($input, $t->{in}), $t->{exp}; } ok not exception { strip_tags_mojo('foo <tag0> bar </tag1>') }; ok exception { strip_tags_xml('foo <tag0> bar') };

    Update 2: Removed "Update" tags from the first paragraph for readability.

      Great answer!! It solved my problem perfectly! I'll look deeper into it to be able to understand it well.

      Neither nor don’t know what the OP was lurking for. Or was it xor? I abandon all hope. Holy shit.

      «The Crux of the Biscuit is the Apostrophe»

Re: Parsing string with tags
by Discipulus (Canon) on May 31, 2022 at 11:02 UTC
    Hello Dirk80,

    might Text::Balanced be what yuo are looking for?

    PS

    use strict; use warnings; use Text::Balanced qw(extract_tagged); my $str = '<tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> word6 < +tag2>word7 word8</tag2> word9 <tag3>word10</tag3>'; # this leads to errors # my $str = 'BEFORE <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0 +> word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> AFTER'; my @res = extract_tagged( $str , '<tag0>', '</tag0>'); foreach (qw(extracted remainder prefix opening included closing)){ my $res = shift @res // 'NA'; print "$_:\t$res\n"; } __END__ extracted: <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> remainder: word6 <tag2>word7 word8</tag2> word9 <tag3>word10</ta +g3> prefix: opening: <tag0> included: word2 <tag1>word3 word4</tag1> word5 closing: </tag0>

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Parsing string with tags
by tangent (Parson) on May 31, 2022 at 15:00 UTC
    This is fairly verbose but I like to use HTML::Parser for parsing non-standard tagged text as it is very flexible and can be fine-tuned. Also good as a learning exercise.
    use HTML::Parser; use Data::Dumper; my $tag_string = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </ +tag0> word6 <tag2>word7 word8</tag2> word9 <tag3>word10</tag3> word11 +'; my $tags = parse_string( $tag_string ); print Dumper( $tags ); sub parse_string { my $string = shift; my %tags; my %check_tags; my $in_tag = 0; my $current_tag = 'no_tag'; my $prev_tag; # process start tag event my $start = sub { my ($tag) = @_; $check_tags{$tag}++; $in_tag++; $prev_tag = $current_tag; $current_tag = $tag; }; # process text event my $text = sub { my ($text) = @_; $text =~ s/^\s+//; $text =~ s/\s+$//; return if not length $text; my @words = split(m/\s+/, $text); push( @{ $tags{$current_tag} }, @words); }; # process end tag event my $end = sub { my ($tag) = @_; $check_tags{$tag}++; $in_tag--; $current_tag = $in_tag ? $prev_tag : 'no_tag'; }; my $parser = HTML::Parser->new( api_version => 3, start_h => [$start, "tagname"], text_h => [$text, "text"], end_h => [$end, "tagname"], default_h => [$text, "text"], ); $parser->parse($string); $parser->eof; # check each tag has an end tag for my $tag (keys %check_tags) { if ($check_tags{$tag} % 2) { print "<$tag> is not valid\n"; } } return \%tags; }
    Output:
    $VAR1 = { 'no_tag' => [ 'word1', 'word6', 'word9', 'word11' ], 'tag0' => [ 'word2', 'word5' ], 'tag1' => [ 'word3', 'word4' ], 'tag2' => [ 'word7', 'word8' ], 'tag3' => [ 'word10' ] };
Re: Parsing string with tags
by tybalt89 (Monsignor) on May 31, 2022 at 20:41 UTC

    Just for fun :)
    It only takes one s/// (with a little extra state)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11144309 use warnings; my $str = 'word1 <tag0> word2 <tag1>word3 word4</tag1> word5 </tag0> w +ord6 <tag2>word7 word8</tag2>word9 <tag3>word10</tag3> word11'; print parsestringwithtags( $str, 'tag0', 'tag1' ), "\n"; print parsestringwithtags( $str, 'tag3' ), "\n"; print parsestringwithtags( $str, 'tag1', 'tag2', 'tag3' ), "\n"; sub parsestringwithtags { local $_ = shift; my %tags = map { $_, 1 } @_; my $active = 1; my @state = ['', 1]; s{ <(/?)(\w+)> | ([^<>]+) | ([<>]) }{ $4 ? die "rogue angle bracket $4 at $-[4]" : $2 # tag ? $1 # it is an end tag ? do { $state[-1][0] eq $2 # matches current tag ? do { $active = (pop @state)->[1]; '' } : die "mismatched tags $state[-1][0] vs $2" } : do { push @state, [$2, $active]; $tags{$2} or $active = 0; + '' } : $3 x $active # non-tag only if active }gex; @state > 1 and die "missing close tag for $state[-1][0]"; return $_; }

    Outputs:

    word1 word2 word3 word4 word5 word6 word9 word11 word1 word6 word9 word10 word11 word1 word6 word7 word8word9 word10 word11
Re: Parsing string with tags
by perlfan (Vicar) on Jun 01, 2022 at 17:28 UTC
    Some moons ago I wanted to do a similar thing that I ended up not doing, and that was parsing tags that including spaces. I didn't get past what syntax I'd use but with all the tagging a lot of sites do now I am sure this is a solved problem. Additionally, I was looking for the flexibility to associate an "attribute" with a tag or set of tags. The application I was writing would have benefitted greatly from this sort of scheme.

    For example, "tags"

    • word1
    • "word2Tag1 word3Tag1 word4Tag3"
    • ATTRIBUTE:word5
    • ATTRIBUTE:"word6Tag2 word7Tag2 word8Tag2"
    Inline, a real example would be something like:
    humans "good picture for thumbnail preview" boysEyes:brown fathersHair:blonde description:"a father and son are playing catch"

    This is obviously not a solution but I wanted to expand a bit on the scope of the _problem_ space. Maybe some wisdom will enlighten how to approach this since I am still interested in figuring this out since it'd have applications for labeling things like images for the borg. At this point there might be a well established way to do this that is not SGML/tag-based.

Re: Parsing string with tags
by Anonymous Monk on May 31, 2022 at 12:08 UTC

    It is neither XML nor is it HTML.

    Why not? Make it xml or html? Its literally what youre describing

    xsh2, xsh2

    And it would be cool if the string would be denied if the tagging is invalid

    yeah dtd can specify such a denial :D or xsd

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144309]
Approved by 1nickt
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-03-29 05:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found