Hello Monks,
I came here for your critics, feedbacks and proposals for improvements. I have developped the simple script for grepping paragraphs (block of text lines delimited by the specific separator (blank lines, by default).
The common use case is parsing of java log entries that can be extended onto multiple lines:
paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME
Another use case is filtering sections from ini files matching particular strings:
paragrep -Pp '^\[' PATTERN FILENAME
For now I am going to improve searching patterns and add support for -a/--and and -o/--or options to control matches. Using this message I ask you to test the script and point me on possible leaks in performance and efficiency.
The original and actual code is hosted on github -- https://github.com/ildar-shaimordanov/perl-utils
Here is the latest (to the moment of creating this message) version of the script:
#!/usr/bin/env perl
=head1 NAME
paragrep - grep-like filter for searching matches in paragraphs
=head1 SYNOPSIS
paragrep --help
paragrep OPTIONS
=head1 DESCRIPTION
paragrep assumes the input consists of paragraphs and prints the
paragraphs matching a pattern. Paragraph is identified as a block of text
delimited by an empty or blank lines.
=head1 OPTIONS
=head2 Generic Program Information
=over 4
=item B<-h>, B<--help>
Print this help message and exit.
=item B<--version>
Print the program version and exit.
=item B<--debug>
Print debug information to STDERR.
=back
=head2 Paragraph Matching Control
=over 4
=item B<-p> I<PATTERN>, B<--break-of-paragraph=>I<PATTERN>
Use I<PATTERN> as the pattern to identify the break of paragraphs. By
default, this value is C<^\s*$>. The break of paragraphs is considered as
a separator and excluded from the output.
=item B<-P>, B<--begin-of-paragraph>
If this option is specified in the command line, the meaning of the option
B<-p> is modified to identify the first line of the paragraph which is
considered as the part of a paragraph.
=back
=head2 Matching Control
=over 4
=item B<-e> I<PATTERN>, B<--regexp=>I<PATTERN>
Use I<PATTERN> as the pattern. This can be used to specify multiple search
patterns, or to protect a pattern beginning with a hyphen (I<->).
This option can be specified multiple times or omitted for briefness.
=item B<-i>, B<--ignore-case>
Ignore case distinctions in both the I<PATTERN> and the input files.
=item B<-v>, B<--invert-match>
Invert the sense of matching, to select non-matching paragraphs.
=item B<-w>, B<--word-regexp>
Select only those paragraphs containing matches that form whole words. The
test is that the matching substring must either be at the beginning of the
line of each paragraphs, or preceded by a non-word constituent character.
Similarly, it must be either at the end of the line of each paragraphs or
followed by a non-word constituent character. Word-constituent characters
are letters, digits, and the underscore.
=back
=head1 EXAMPLES
The following example demonstrates the customized paragraph definition for
parsing log files. Usually, applications producing log files write one log
entry per one line. Somethimes applications (especially written in Java)
produce multiline log entries. Each log entry begins with the timestamp in
the generalized form C<date time>, which can be covered by the pattern
C<\d+/\d+/\d+ \d+:\d+:\d+> without reflecting on which date format has
been used to output dates:
paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME
=head1 SEE ALSO
grep(1)
perlre(1)
=head1 COPYRIGHT
Copyright 2017 Ildar Shaimordanov E<lt>F<ildar.shaimordanov@gmail.com>E<gt>
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
=cut
# =========================================================================
use strict;
use warnings;
no warnings "utf8";
use open qw( :std :utf8 );
use Pod::Usage;
use Getopt::Long qw( :config no_ignore_case bundling auto_version );
our $VERSION = "0.2";
my $debug = 0;
my $verbose = 0;
my $break_of_para = '^\\s*$';
my $begin_of_para = 0;
my $ignore_case = 0;
my $invert_match = 0;
my $word_regexp = 0;
my @patterns = ();
my $match_pattern;
my @globs = ();
my @files = ();
# =========================================================================
pod2usage unless GetOptions(
"h|help" => sub {
pod2usage({
-verbose => 2,
-noperldoc => 0,
});
},
"debug" => \$debug,
"p|break-of-paragraph=s" => \$break_of_para,
"P|begin-of-paragraph" => \$begin_of_para,
"e|regexp=s" => \@patterns,
"i|ignore-case" => \$ignore_case,
"v|invert-match" => \$invert_match,
"w|word-regexp" => \$word_regexp,
"<>" => sub {
push @globs, $_[0];
},
);
# =========================================================================
sub validate_re {
my ( $v, $k, $ignore_case, $word_regexp ) = ( shift, shift || "<anon>", shift, shift );
$v = "\\b($v)\\b" if $word_regexp;
my $re = eval { $ignore_case ? qr/$v/im : qr/$v/m };
die "Bad regexp: $k = $v\n" if $@;
$re;
}
# If no patterns, assume the first item of the list is the pattern
push @patterns, shift @globs if ! @patterns && @globs;
# Validate all the patterns before combining into the single one
pod2usage unless @patterns;
map { validate_re $_, "pattern", $ignore_case } @patterns;
# Combine all patterns into the single pattern
$match_pattern = validate_re join("|", @patterns), "", $ignore_case, $word_regexp;
# Expand filename patterns
@files = map { glob } @globs;
# If the list of files is empty, assume reading from STDIN
push @files, "-" unless @files;
# Validate and setup the pattern identifying paragraphs
$break_of_para = validate_re $break_of_para, "break-of-paragraph";
# =========================================================================
warn <<DATA if $debug;
PARAGRAPH MATCHING CONTROL
break-of-paragraph = $break_of_para
begin-of-paragraph = $begin_of_para
MATCHING CONTROL
match-pattern = $match_pattern
invert-match = $invert_match
FILES
@files
DATA
# =========================================================================
my $para;
sub print_para {
print $para if defined $para && ( $para =~ m/$match_pattern/ ^ $invert_match );
$para = "";
}
sub grep_file {
my $file = shift;
if ( $file eq "-" ) {
*FILE = *STDIN;
} else {
if ( -d $file ) {
warn "Not a file: $file\n";
return;
}
open FILE, $file or do {
warn "Unable to read file: $file\n";
return;
};
}
while ( <FILE> ) {
if ( m/$break_of_para/ ) {
print_para;
next unless $begin_of_para;
};
$para .= $_;
}
print_para if $para;
close FILE unless $file eq "-";
}
# =========================================================================
grep_file $_ foreach ( @files );
# =========================================================================
# EOF
Thank you
Re: Paragraph grep: request for testing, comments and feedbacks
by hippo (Bishop) on Oct 05, 2017 at 10:39 UTC
|
The original and actual code is hosted on github (It's not permitted to post external links but you can search for ildar-shaimordanov/perl-utils)
Actually, posting external links is fine in general. What is frowned upon is refusing to post code here and instead saying "You can see my code at http://www.geocities.com/..." because over time the externally linked code may degrade or vanish and the resultant thread is then rather moot. Since you've posted your script here as it stands I fail to see how anyone could also object to linking to your github repo, especially when the README there links back to the monastery.
Your script looks in pretty good shape to me from a cursory inspection. I'll be pleased to test it out when I have some time.
| [reply] |
|
Couple minutes ago I've been able to update my post adding the link to the guthub repository in the front of the source code. I think it would be better, if some one looking for the similar functionality would have found the link to the actual version next to the initial code published here.
| [reply] |
Re: Paragraph grep
by Anonymous Monk on Oct 04, 2017 at 18:50 UTC
|
$ cat input.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean quis
elit tempus, hendrerit sem a, maximus urna. Aenean vitae est at risus
fringilla egestas vitae in lacus.
In a metus vel elit varius rhoncus. Morbi at sem euismod, tincidunt
nunc quis, maximus quam. Sed maximus nibh vel suscipit ullamcorper.
Mauris sed ex ut nulla accumsan feugiat.
Donec sit amet sapien laoreet mauris sodales scelerisque. Aliquam
varius diam sit amet mollis iaculis. Quisque vel neque auctor, feugiat
velit eleifend, ultrices nunc. Vivamus condimentum metus quis nunc
tincidunt lobortis. Fusce a dolor sed tellus condimentum vulputate.
Proin ac tortor ut metus mattis gravida. Ut quis orci ornare, aliquet
dolor id, commodo justo.
$ perl -ln00e '/sed/i and print' input.txt
In a metus vel elit varius rhoncus. Morbi at sem euismod, tincidunt
nunc quis, maximus quam. Sed maximus nibh vel suscipit ullamcorper.
Mauris sed ex ut nulla accumsan feugiat.
Donec sit amet sapien laoreet mauris sodales scelerisque. Aliquam
varius diam sit amet mollis iaculis. Quisque vel neque auctor, feugiat
velit eleifend, ultrices nunc. Vivamus condimentum metus quis nunc
tincidunt lobortis. Fusce a dolor sed tellus condimentum vulputate.
| [reply] [d/l] |
|
Thanks for your comment. I know these options. But they don't solve the task of parsing log files. Most probably, I haven't been very specific and some explanations are required.
A log file could be:
2017-09-04 22:02:14.123 INFO: Some log message having param1=value1
2017-09-04 22:02:14.349 DEBUG: Multiline log entry
Some extended logging:
debug {
param1 value1
param2 value2
}
2017-09-04 22:02:14.658 INFO: Another log message param2=value2
If we need all entries containing some specific strings (let say value1), it is difficult to parse the file with -00. That's why I (re)invented a bike. :) | [reply] [d/l] [select] |
Re:Paragraph grep: request for testing, comments and feedbacks
by siberia-man (Friar) on Sep 28, 2019 at 02:53 UTC
|
After almost two years of moderate usage I encountered lack of one of useful feature of the standard grep: prepending line numbers and file names to the output. This night I decided to close this issue and implemented the absent functionality. By the way I turned on auto flushing as well. Please meet the updated version and use/test it if you want :)
# print line numbers
paragrep -n PATTERN FILENAME
# print file names
paragrep -H PATTERN FILENAME
# suppress printing file names
paragrep -h PATTERN FILENAME...
Later, if none of us discover any bugs, I update the initial post with the recent version of the script.
The script lives on github by the link https://github.com/ildar-shaimordanov/perl-utils/blob/master/perl/paragrep. | [reply] [d/l] |
Re: Paragraph grep: request for testing, comments and feedbacks
by siberia-man (Friar) on Nov 27, 2017 at 00:09 UTC
|
In continuation of this thread I am happy to say that I improved and extended the script. The new options --file=FILE, --or and --and are shipped with new version. In accordance of the script description they work as follows:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
-A, --and, -O, --or
These options specify whether multiple search patterns specified by the -e options should be logically ANDed together or logically ORed together. If not specified, the patterns are assumed logically ORed. These options can be used to simplify the commands searching for matches to multiple patterns. More than one of them can be specified but the only last pattern has affect.
The following example shows how the combining option simplifies usage. The
resulting output will consist of the paragraphs matching both PATTERN1
and PATTERN2.
cat FILENAME | paragrep -e PATTERN1 -e PATTERN2 -A
cat FILENAME | paragrep -e PATTERN1 | paragrep -e PATTERN2
Welcome for meditations, Monks :) | [reply] [d/l] [select] |
|
|