in reply to Re: Entity statistics
in thread Entity statistics

It might be harder than it seems.
  1. XML can contain (among other things) attributes, comments, and processing instructions. Are you sure you want to include their contents into the statistics?
  2. ISO entities are not part of the XML. There probably is some DTD that defines them, but as they are not standard (in XML), it might be hard to process them properly (and the DTD might define them in a non-standard way). Moreover, the section mark can be also included in XML as § (or &#xA7, or §), and Art can be repesented as Art, for example.

See this example (using PRE instead of CODE to include the section mark):


#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use experimental qw( signatures );
use utf8;

use XML::LibXML;
use Encode qw{ encode };

sub create_xml($xml) {
    open my $out, '>:encoding(UTF-8)', $xml or die $!;
    print {$out} <<~'__XML__';
    <?xml version="1.0"?>
    <!DOCTYPE root [
        <!ENTITY sect "§">
    ]>
    <root link="Art.VV">
        A &sect; 1 A
        B Art.XVI B
        C §  9 C
        D &#xa7; 7 D
        E &#167; 6 E
        <!-- Should comments be included in statistics? Art.XXX -->
        <?print "Should processing instructions be included?" Art.2 ?>
    </root>
    __XML__
}

sub validate_xml($xml) {
    my $dom = 'XML::LibXML'->load_xml(location => $xml);
    print $dom;
}

sub generate_statistics($xml) {
    my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/);

    open my $in, '<:encoding(UTF-8)', $xml or die $!;
    my $string = do { local $/; <$in> };
    my @tally;
    for my $i (0 .. $#regexes) {
        my $regex = $regexes[$i];
        ++$tally[$i] while $string =~ /$regex/g;
    }
    for my $i (0 .. $#regexes) {
        say encode('UTF-8', "$regexes[$i]:\t$tally[$i]");
    }
}

my $xml = '1.xml';
create_xml($xml);
validate_xml($xml);
generate_statistics($xml);
unlink $xml;

The output:
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY sect "§">
]>
<root link="Art.VV">
    A &#xA7; 1 A
    B Art.XVI B
    C &#xA7;  9 C
    D &#xA7; 7 D
    E &#xA7; 6 E
    <!-- Should comments be included in statistics? Art.XXX -->
    <?print "Should processing instructions be included?" Art.2 ?>
</root>
(?^u:§\s*[0-9]):	1
(?^:Art\.\s*[0-9IVX]):	4

You see? The section mark was not counted, the attribute, comment, and processing instruction were. Probably not what you want.

Update: Included &#xa7;.

Update 2: Print the XML to show how some representations of the section mark are equivalent.

Update 3: Added an attribute.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]