in reply to Re^2: Entity statistics
in thread Entity statistics

my @regexes = (§\s*[0-9], Art\.\s*[0-9IVX, ...)

Like that, except that each regex needs to be contained in some way otherwise it will look like perl code. You can either enclose them in quotes or mark them as regex by using the qr// operator like this:

my @regexes = (qr/§\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, ...)
Then how do I read "a data file into a scalar as a string"?

Mostly as how you have said you do it normally but being sure to concatenate each line or to read them all at once. There are modules which can help with this such as Path::Tiny, File::Slurper and so on. See lots more about this in the Illumination How do I read an entire file into a string?

my $infile = $ARGV[0]; open my $inh, '<', $infile or die "Cannot open $infile for reading: $! +"; my $xml; { local $/ = undef; $xml = <$inh>; } close $inh;
Which kind of loop construct do you think of?

I was thinking of a for loop, as that is the trivial way to iterate over an array unless there is a good reason to use something else (which does not appear to be the case here).

Thanks for clarifying about the entities. Those should be fine as they are just data. You may need to escape any characters which have special meaning to the regular expression engine but otherwise they should not cause any problems. Try it and see how you get along.


🦛

Replies are listed 'Best First'.
Re^4: Entity statistics
by LexPl (Beadle) on Nov 12, 2024 at 13:15 UTC

    First of all, many thanks for the helpful assistance and good advice from @choroba and @hippo!

    I have taken up your input and build the following script:

    #!/usr/bin/perl use warnings; use strict; use diagnostics; my $infile = $ARGV[0]; my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, qr/Artikel\s* +[0-9IVX]/, qr/Artikels\s*[0-9IVX]/, qr/Artikeln\s*[0-9IVX]/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; my $xml; { local $/ = undef; $xml = <$in>; } my $tally; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } for my $i (0 .. $#regexes) { print "$regexes[$i]:\t$tally[$i]\n"; } close $in;

    With use strict; I get the following error message:

    Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you forget +to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors (#1) (F) You've said "use strict" or "use strict vars", which indicates that all variables must either be lexically scoped (using "my" or +"state"), declared beforehand using "our", or explicitly qualified to say which package the global variable is in (using "::"). Uncaught exception from user code: Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 24. Global symbol "@tally" requires explicit package name (did you + forget to declare "my @tally"?) at monk2.pl line 28. Execution of monk2.pl aborted due to compilation errors.</i>

    As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

    If I run the same script without use strict;, the output looks like this:

    (?^:&sect;\s*[0-9]): 3 (?^:Art\.\s*[0-9IVX]): 2 (?^:Artikel\s*[0-9IVX]): 2 (?^:Artikels\s*[0-9IVX]): 2 (?^:Artikeln\s*[0-9IVX]): 2

    How could I get rid of "(?^:" and ")"? Would it be possible to save this output to a file?

    Have a nice afternoon!

      > As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

      The scalar variable $tally is different to an array variable @tally. Single members of the array are called with a dollar sign followed by a square bracket, but they are still elements of the array @tally. So, you need to declare the array:

      my @tally;

      > How could I get rid of "(?^:" and ")"?

      One possibility is to use a regex:

      for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print "$regex:\t$tally[$i]\n"; }

      > Would it be possible to save this output to a file?

      The easiest way is to use redirection in your shell, it should work even in MSWin.

      perl script.pl > output.txt

      If you want to write to a file from within Perl, open a file for writing and print to it:

      open my $out, '>', 'output.txt' or die $!; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print {$out} "$regex:\t$tally[$i]\n"; }
      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

        Thanks to your kind assistance I could get a working statistics tool :)

        But when I apply the script listed below to another file, I get the following error which really puzzles me:

        Use of uninitialized value in concatenation (.) or string at whitespace-stat.pl line 47, <$in> line 1 (#1)
        (W uninitialized) An undefined value was used as if it were already defined. It was interpreted as a "" or a 0, but maybe it was a mistake. To suppress this warning assign a defined value to your variables.

        To help you figure out what was undefined, perl will try to tell you the name of the variable (if any) that was undefined. In some cases it cannot do this, so it also tells you what operation you used the undefined value in. Note, however, that perl optimizes your program and the operation displayed in the warning may not necessarily appear literally in your program. For example, "that $foo" is usually optimized into "that " . $foo, and the warning will refer to the concatenation (.) operator, even though there is no . in your program.

        #!/usr/bin/perl use warnings; use strict; use diagnostics; #my personal data left out! print "Generate statistics: Whitespace in context\n"; my $infile = $ARGV[0]; #define regexes as search target (in the array @regexes) my @regexes = (qr/&sect;\s*[0-9]/, qr/Art\.\s*[0-9IVX]/, qr/Artikel\s* +[0-9IVX]/, qr/Artikels\s*[0-9IVX]/, qr/Artikeln\s*[0-9IVX]/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; #read input file in variable $xml my $xml; { local $/ = undef; $xml = <$in>; } #define array for frequency values my @tally; #count routine for each regex for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; ++$tally[$i] while $xml =~ /$regex/g; } #define output file open my $out, '>', 'stats.txt' or die $!; #output statistics print {$out} "Statistics: Whitespace in context\n\ninput file: "; print {$out} "$infile"; print {$out} "\n====================================================== +==================\n\n"; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print {$out} "$regex:\t\t$tally[$i]\n"; } close $in; close $out;

      As the variable $tally is defined beforehand and preceded by the keyword "my", I don't understand what is wrong. How could I fix this?

      You have declared $tally which is a scalar but the errors are telling you about @tally which is an array. Since your loops refer to the array and not the scalar, that is what you need to declare instead. See the basic datatypes, three for more about the basic data types in Perl and how the sigils relate to them.

      How could I get rid of "(?^:" and ")"?

      You could process the string which you actually output to achieve this but in this particular case you can avoid that by using quotes to delimit each regex in the first place instead of using the qr// operator. You can use single quotes 'foo' or q/foo/ for non-interpolated strings. ie:

      my @regexes = (q/&sect;\s*[0-9]/, q/Art\.\s*[0-9IVX]/, q/Artikel\s*[0- +9IVX]/, q/Artikels\s*[0-9IVX]/, q/Artikeln\s*[0-9IVX]/);

      Bear in mind that these are now just simple strings so you need to take care to explicitly use them in a regex content. But as that is what the rest of your code does anyway, there is no further change required here.

      Would it be possible to save this output to a file?

      Of course. See eg. Re: How do I write to a file?

      Do have a browse through the Tutorials section here and the Getting Started with Perl section in particular. These should help you achieve some of these simple tasks while you become more familiar with the language.


      🦛