in reply to Entity statistics

Hello LexPl and welcome to the Monastery and to Perl in general.

One simple approach to your task would be to construct an array of your regex patterns, read your data file into a scalar as a string and then loop over the array and count the matches of each one in the string. To count matches in a string you can use this form:

my $count =()= $string =~ /regex/gs;
These patterns often also contain ISO entities.

It isn't entirely clear to me precisely what you mean by this. Could you elaborate? It may or may not have any bearing on the task.

(Updated: typo fix - thanks, choroba.)


🦛

Replies are listed 'Best First'.
Re^2: Entity statistics
by choroba (Cardinal) on Nov 08, 2024 at 15:08 UTC
    It might be harder than it seems.
    1. XML can contain (among other things) attributes, comments, and processing instructions. Are you sure you want to include their contents into the statistics?
    2. ISO entities are not part of the XML. There probably is some DTD that defines them, but as they are not standard (in XML), it might be hard to process them properly (and the DTD might define them in a non-standard way). Moreover, the section mark can be also included in XML as § (or &#xA7, or §), and Art can be repesented as Art, for example.

    See this example (using PRE instead of CODE to include the section mark):


    #!/usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    use experimental qw( signatures );
    use utf8;
    
    use XML::LibXML;
    use Encode qw{ encode };
    
    sub create_xml($xml) {
        open my $out, '>:encoding(UTF-8)', $xml or die $!;
        print {$out} <<~'__XML__';
        <?xml version="1.0"?>
        <!DOCTYPE root [
            <!ENTITY sect "§">
        ]>
        <root link="Art.VV">
            A &sect; 1 A
            B Art.XVI B
            C §  9 C
            D &#xa7; 7 D
            E &#167; 6 E
            <!-- Should comments be included in statistics? Art.XXX -->
            <?print "Should processing instructions be included?" Art.2 ?>
        </root>
        __XML__
    }
    
    sub validate_xml($xml) {
        my $dom = 'XML::LibXML'->load_xml(location => $xml);
        print $dom;
    }
    
    sub generate_statistics($xml) {
        my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/);
    
        open my $in, '<:encoding(UTF-8)', $xml or die $!;
        my $string = do { local $/; <$in> };
        my @tally;
        for my $i (0 .. $#regexes) {
            my $regex = $regexes[$i];
            ++$tally[$i] while $string =~ /$regex/g;
        }
        for my $i (0 .. $#regexes) {
            say encode('UTF-8', "$regexes[$i]:\t$tally[$i]");
        }
    }
    
    my $xml = '1.xml';
    create_xml($xml);
    validate_xml($xml);
    generate_statistics($xml);
    unlink $xml;
    

    The output:
    <?xml version="1.0"?>
    <!DOCTYPE root [
    <!ENTITY sect "§">
    ]>
    <root link="Art.VV">
        A &#xA7; 1 A
        B Art.XVI B
        C &#xA7;  9 C
        D &#xA7; 7 D
        E &#xA7; 6 E
        <!-- Should comments be included in statistics? Art.XXX -->
        <?print "Should processing instructions be included?" Art.2 ?>
    </root>
    (?^u:§\s*[0-9]):	1
    (?^:Art\.\s*[0-9IVX]):	4
    

    You see? The section mark was not counted, the attribute, comment, and processing instruction were. Probably not what you want.

    Update: Included &#xa7;.

    Update 2: Print the XML to show how some representations of the section mark are equivalent.

    Update 3: Added an attribute.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re^2: Entity statistics
by LexPl (Beadle) on Nov 08, 2024 at 14:13 UTC
    Hi,

    Thanks for your kind welcome!

    Let me try to circumscribe what you told me so that I understand it correctly. I would put my regexes in an array, e.g.

    my @regexes = (&sect;\s*[0-9], Art\.\s*[0-9IVX, ...)

    Is this what you meant?

    Then how do I read "a data file into a scalar as a string"? Is it just my $file = 'fname.xml'?

    Normally I use a file handle like this

    my $infile = $ARGV[0]; open(IN, '<' . $infile) or die $!;

    Which kind of loop construct do you think of?

    With regard to the ISO entities, &sect; which stands for the "§" symbol is an example what I meant.

      my @regexes = (&sect;\s*[0-9], Art\.\s*[0-9IVX, ...)

      Like that, except that each regex needs to be contained in some way otherwise it will look like perl code. You can either enclose them in quotes or mark them as regex by using the qr// operator like this:

      my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, ...)
      Then how do I read "a data file into a scalar as a string"?

      Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string?

      my $infile = $ARGV[0]; open my $inh, '<', $infile or die "Cannot open $infile for reading: $! +"; my $xml; { local $/ = undef; $xml = <$inh>; } close $inh;
      Which kind of loop construct do you think of?

      I was thinking of a for loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here).

      Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along.


      🦛

        First of all, many thanks for the helpful assistance and good advice from @choroba and @hippo!

        I have taken up your input and build the following script:

        #!/usr/bin/perl use warnings; use strict; use diagnostics; my $infile = $ARGV[0]; my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, qr/Artikel\s* +[0-9IVX]/, qr/Artikels\s*[0-9IVX]/, qr/Artikeln\s*[0-9IVX]/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; my $xml; { local $/ = undef; $xml = <$in>; } my $tally; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } for my $i (0 .. $#regexes) { print "$regexes[$i]:\t$tally[$i]\n"; } close $in;

        With use strict; I get the following error message:

        Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates that all variables must either be lexically scoped (using "my" or +"state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors.</i>

        As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

        If I run the same script without use strict;, the output looks like this:

        (?^:&sect;\s*[0-9]): 3 (?^:Art\.\s*[0-9IVX]): 2 (?^:Artikel\s*[0-9IVX]): 2 (?^:Artikels\s*[0-9IVX]): 2 (?^:Artikeln\s*[0-9IVX]): 2

        How could I get rid of "(?^:" and ")"? Would it be possible to save this output to a file?

        Have a nice afternoon!