Re: Entity statistics

Replies are listed 'Best First'.
Re^2: Entity statistics by choroba (Cardinal) on Nov 08, 2024 at 15:08 UTC
It might be harder than it seems. XML can contain (among other things) attributes, comments, and processing instructions. Are you sure you want to include their contents into the statistics? ISO entities are not part of the XML. There probably is some DTD that defines them, but as they are not standard (in XML), it might be hard to process them properly (and the DTD might define them in a non-standard way). Moreover, the section mark can be also included in XML as `§` (or `&#xA7`, or `§`), and Art can be repesented as `Art`, for example. See this example (using PRE instead of CODE to include the section mark): #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use experimental qw( signatures ); use utf8; use XML::LibXML; use Encode qw{ encode }; sub create_xml($xml) { open my $out, '>:encoding(UTF-8)', $xml or die $!; print {$out} <<~'__XML__'; <?xml version="1.0"?> <!DOCTYPE root [ <!ENTITY sect "§"> ]> <root link="Art.VV"> A § 1 A B Art.XVI B C § 9 C D § 7 D E § 6 E <!-- Should comments be included in statistics? Art.XXX --> <?print "Should processing instructions be included?" Art.2 ?> </root> __XML__ } sub validate_xml($xml) { my $dom = 'XML::LibXML'->load_xml(location => $xml); print $dom; } sub generate_statistics($xml) { my @regexes = (qr/§\s[0-9]/, qr/Art\.\s[0-9IVX]/); open my $in, '<:encoding(UTF-8)', $xml or die $!; my $string = do { local $/; <$in> }; my @tally; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $string =~ /$regex/g; } for my $i (0 .. $#regexes) { say encode('UTF-8', "$regexes[$i]:\t$tally[$i]"); } } my $xml = '1.xml'; create_xml($xml); validate_xml($xml); generate_statistics($xml); unlink $xml; The output: <?xml version="1.0"?> <!DOCTYPE root [ <!ENTITY sect "§"> ]> <root link="Art.VV"> A § 1 A B Art.XVI B C § 9 C D § 7 D E § 6 E <!-- Should comments be included in statistics? Art.XXX --> <?print "Should processing instructions be included?" Art.2 ?> </root> (?^u:§\s[0-9]): 1 (?^:Art\.\s[0-9IVX]): 4 You see? The section mark was not counted, the attribute, comment, and processing instruction were. Probably not what you want. Update: Included `§`. Update 2: Print the XML to show how some representations of the section mark are equivalent. Update 3: Added an attribute. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Entity statistics by LexPl (Beadle) on Nov 08, 2024 at 14:13 UTC
Hi, Thanks for your kind welcome! Let me try to circumscribe what you told me so that I understand it correctly. I would put my regexes in an array, e.g. `my @regexes = (§\s[0-9], Art\.\s[0-9IVX, ...)` Is this what you meant? Then how do I read "a data file into a scalar as a string"? Is it just `my $file = 'fname.xml'`? Normally I use a file handle like this `my $infile = $ARGV[0]; open(IN, '<' . $infile) or die $!;` [download] Which kind of loop construct do you think of? With regard to the ISO entities, `§` which stands for the "§" symbol is an example what I meant.	[reply] [d/l] [select]
Re^3: Entity statistics by hippo (Archbishop) on Nov 08, 2024 at 14:52 UTC
`my @regexes = (§\s[0-9], Art\.\s[0-9IVX, ...)` Like that, except that each regex needs to be contained in some way otherwise it will look like perl code. You can either enclose them in quotes or mark them as regex by using the `qr//` operator like this: `my @regexes = (qr/§\s[0-9]/, qr/Art\.\s[0-9IVX]/, ...)` [download] Then how do I read "a data file into a scalar as a string"? Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string? `my $infile = $ARGV[0]; open my $inh, '<', $infile or die "Cannot open $infile for reading: $! +"; my $xml; { local $/ = undef; $xml = <$inh>; } close $inh;` [download] Which kind of loop construct do you think of? I was thinking of a `for` loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here). Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along. 🦛	[reply] [d/l] [select]
Re^4: Entity statistics by LexPl (Beadle) on Nov 12, 2024 at 13:15 UTC
First of all, many thanks for the helpful assistance and good advice from @choroba and @hippo! I have taken up your input and build the following script: #!/usr/bin/perl use warnings; use strict; use diagnostics; my $infile = $ARGV[0]; my @regexes = (qr/§\s[0-9]/, qr/Art\.\s[0-9IVX]/, qr/Artikel\s* +[0-9IVX]/, qr/Artikels\s[0-9IVX]/, qr/Artikeln\s[0-9IVX]/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; my $xml; { local $/ = undef; $xml = <$in>; } my $tally; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } for my $i (0 .. $#regexes) { print "$regexes[$i]:\t$tally[$i]\n"; } close $in; [download] With `use strict;` I get the following error message: Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates that all variables must either be lexically scoped (using "my" or +"state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors.</i> [download] As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this? If I run the same script without `use strict;`, the output looks like this: `(?^:§\s[0-9]): 3 (?^:Art\.\s[0-9IVX]): 2 (?^:Artikel\s[0-9IVX]): 2 (?^:Artikels\s[0-9IVX]): 2 (?^:Artikeln\s*[0-9IVX]): 2` [download] How could I get rid of "(?^:" and ")"? Would it be possible to save this output to a file? Have a nice afternoon!	[reply] [d/l] [select]
Re^5: Entity statistics by choroba (Cardinal) on Nov 12, 2024 at 13:23 UTC
Re^6: Entity statistics by LexPl (Beadle) on Nov 12, 2024 at 16:50 UTC
Some notes below your chosen depth have not been shown here
Re^5: Entity statistics by hippo (Archbishop) on Nov 12, 2024 at 13:34 UTC