rsriram has asked for the wisdom of the Perl Monks concerning the following question:

Hi, In a text file, I want to check whether all the tags used are exclusively from the array of given set of elements. Below is the code, I use for this.

my $file = $_; my @tags=(qw(kt bold ital ch([^>]*))); my %allowed_tag = map { $_ => 1 } @tags; while (defined (my $line = <F1>)) { $line =~ /<([^>].+)>/; my $tag = $1; die "Invalid element $tag in $.. Cannot proceed due to the above er +ror\n" unless exists $allowed_tag{$tag};

This works well when there are no attributes to the tag i.e., the first three elements in the array kt, bold, ital. But I also need to allow elements which has partly variable content in it. For example, the tag for chapter will be <ch1>. The last number indicates the chapters actual number. I also have tags like <fig n="1">. If I use a regular expression (like what I had done above), the script does not ignore the tag. It reports a error. Can anyone help me out? The markup is not XML, so I cannot use any XML modules.

Replies are listed 'Best First'.
Re: Matching elements in a array
by shmem (Chancellor) on Aug 23, 2006 at 07:31 UTC
    You are doing it the wrong way round. If $line contains <ch1>, your $1 will contain ch1 after the pattern match and not c([^>]*) which is the key you have in your %allowed hash.

    So, match as little as possible in your regexp, e.g

    $line =~ /<([^>]+?)\d+?\s*>/;

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Matching elements in a array
by duckyd (Hermit) on Aug 23, 2006 at 07:45 UTC
    I'm not sure I understand - it looks like you want to have the "ch(^>*)" in @tags treated as a regex, but you're not actually doing that. Since you can't use a regex as a hash key to do what you want, you might consider using an array of regexes. Also, when matching the tag, you probably want to match up to the first space or >:
    my @tag_regexes = ( qr/^kt$/, qr/^bold$/, qr/^ital$/, qr/^ch\d$/, ); while (defined (my $line = <DATA>)) { $line =~ /<([^ >]+)/; my $tag = $1; die "Invalid element $tag in $.. Cannot proceed due to the above e +rror\n" unless grep { $tag =~ $_ } @tag_regexes; print "tag: $tag\n"; } __DATA__ <kt> <bold> <ital> <ch1> <kt someattribute="somevalue"> <bold someattribute="somevalue"> <ital someattribute="somevalue"> <ch1 someattribute="somevalue">
      Actually, I just realized that if you do go this route you can avoid needing to match twice:
      my @tag_regexes = map { qr/<($_) ?[^>]*>/ } qw/kt bold ital ch\d/; while (defined (my $line = <DATA>)) { die "Invalid element in line #$. ($line). Cannot proceed due to th +e above error\n" unless grep { $line =~ $_ } @tag_regexes; } __DATA__ <kt> <bold> <ital> <ch1> <kt someattribute="somevalue"> <bold someattribute="somevalue"> <ital someattribute="somevalue"> <ch1 someattribute="somevalue"> <foo>
      Actually, I just realized that if you do go this route you can avoid needing to match twice: <code> my @tag_regexes = map { qr/<($_) ?^>*>/ } qw/kt bold ital ch\d/
Re: Matching elements in a array
by cdarke (Prior) on Aug 23, 2006 at 07:55 UTC
    I'm puzzled by the ch(^>*), like everyone else.
    Your creation of the hash is a bit laboured, easier to :
    my %allowed_tag; @allowed_tag{@tags} = ();
    The values are now undef, but that's OK because you are checking with exists.
    In general with these kinds of REs you need to define a list of ALL possible patterns (tags) that you need to check for, and proceed from there.
Re: Matching elements in a array
by rodion (Chaplain) on Aug 23, 2006 at 11:19 UTC
    duckyd's solution checks that all lines have at least one valid tag in them. I think you were looking for a check that each tag, when present, is a valid tag, skipping over any lines that don't have tags.

    Also, if there's more than one tag on a line, you probably want to check all of them. You can get all the tags to check easily with the "g" option to the m// match operator, if you use it in an array context. Note that the match has to be non-greedy, using "+?" instead of "+", otherwise it will lump tags toghether.

    Try this, which catches the second tag in the last line. (tested)

    my $allowed = qr{kt$|bold$|ital$|ch}; while (defined (my $line = <DATA>)) { die "Element $1 invalid,line #$.\n" if (grep { $_ !~ /^$allowed/ } $line =~ /<([^>].+?)>/g); } __DATA__ <kt> <bold> <ital> <ch1> <ch_named> Line of content this line has an unspecified <ch> tag <ch><Bold>
Re: Matching elements in a array
by mantra2006 (Hermit) on Aug 23, 2006 at 12:29 UTC
    Hey

    You can use array union, intersection & diffrence hope this might
    help you..here is the sample code
    @union = @intersection = @difference = (); %count = (); foreach $element (@Array, @lArray) { $count{$element}++ } foreach $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference } +, $element; } foreach $int (@difference){ if ($int ne ";"){ @fname = split(/;/, $int); print "\n difference elements --> $fname[0]\n"; } }



    Sridhar