comment on

Hello Monks, I came here for your critics, feedbacks and proposals for improvements. I have developped the simple script for grepping paragraphs (block of text lines delimited by the specific separator (blank lines, by default).

The common use case is parsing of java log entries that can be extended onto multiple lines:

paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME

Another use case is filtering sections from ini files matching particular strings:

paragrep -Pp '^\[' PATTERN FILENAME

For now I am going to improve searching patterns and add support for -a/--and and -o/--or options to control matches. Using this message I ask you to test the script and point me on possible leaks in performance and efficiency.

The original and actual code is hosted on github -- https://github.com/ildar-shaimordanov/perl-utils
Here is the latest (to the moment of creating this message) version of the script:

#!/usr/bin/env perl

=head1 NAME

paragrep - grep-like filter for searching matches in paragraphs

=head1 SYNOPSIS

    paragrep --help
    paragrep OPTIONS

=head1 DESCRIPTION

paragrep assumes the input consists of paragraphs and prints the 
paragraphs matching a pattern. Paragraph is identified as a block of text 
delimited by an empty or blank lines. 

=head1 OPTIONS

=head2 Generic Program Information

=over 4

=item B<-h>, B<--help>

Print this help message and exit.

=item B<--version>

Print the program version and exit.

=item B<--debug>

Print debug information to STDERR.

=back

=head2 Paragraph Matching Control

=over 4

=item B<-p> I<PATTERN>, B<--break-of-paragraph=>I<PATTERN>

Use I<PATTERN> as the pattern to identify the break of paragraphs. By 
default, this value is C<^\s*$>. The break of paragraphs is considered as 
a separator and excluded from the output.

=item B<-P>, B<--begin-of-paragraph>

If this option is specified in the command line, the meaning of the option 
B<-p> is modified to identify the first line of the paragraph which is 
considered as the part of a paragraph.

=back

=head2 Matching Control

=over 4

=item B<-e> I<PATTERN>, B<--regexp=>I<PATTERN>

Use I<PATTERN> as the pattern. This can be used to specify multiple search 
patterns, or to protect a pattern beginning with a hyphen (I<->). 

This option can be specified multiple times or omitted for briefness. 

=item B<-i>, B<--ignore-case>

Ignore case distinctions in both the I<PATTERN> and the input files. 

=item B<-v>, B<--invert-match>

Invert the sense of matching, to select non-matching paragraphs.

=item B<-w>, B<--word-regexp>

Select only those paragraphs containing matches that form whole words. The 
test is that the matching substring must either be at the beginning of the 
line of each paragraphs, or preceded by a non-word constituent character. 
Similarly, it must be either at the end of the line of each paragraphs or 
followed by a non-word constituent character. Word-constituent characters 
are letters, digits, and the underscore. 

=back

=head1 EXAMPLES

The following example demonstrates the customized paragraph definition for 
parsing log files. Usually, applications producing log files write one log 
entry per one line. Somethimes applications (especially written in Java) 
produce multiline log entries. Each log entry begins with the timestamp in 
the generalized form C<date time>, which can be covered by the pattern 
C<\d+/\d+/\d+ \d+:\d+:\d+> without reflecting on which date format has 
been used to output dates:

    paragrep -Pp '^\d+/\d+/\d+ \d+:\d+:\d+' PATTERN FILENAME

=head1 SEE ALSO

grep(1)

perlre(1)

=head1 COPYRIGHT

Copyright 2017 Ildar Shaimordanov E<lt>F<ildar.shaimordanov@gmail.com>E<gt>

This program is free software; you can redistribute it and/or modify it 
under the same terms as Perl itself.

=cut

# =========================================================================

use strict;
use warnings;

no warnings "utf8";
use open qw( :std :utf8 );

use Pod::Usage;
use Getopt::Long qw( :config no_ignore_case bundling auto_version );

our $VERSION = "0.2";

my $debug = 0;
my $verbose = 0;

my $break_of_para = '^\\s*$';
my $begin_of_para = 0;

my $ignore_case = 0;
my $invert_match = 0;
my $word_regexp = 0;

my @patterns = ();
my $match_pattern;

my @globs = ();
my @files = ();

# =========================================================================

pod2usage unless GetOptions(
	"h|help" => sub {
		pod2usage({
			-verbose => 2, 
			-noperldoc => 0, 
		});
	}, 

	"debug" => \$debug, 

	"p|break-of-paragraph=s" => \$break_of_para, 
	"P|begin-of-paragraph" => \$begin_of_para, 

	"e|regexp=s" => \@patterns, 

	"i|ignore-case" => \$ignore_case, 
	"v|invert-match" => \$invert_match, 
	"w|word-regexp" => \$word_regexp, 

	"<>" => sub {
		push @globs, $_[0];
	}, 
);

# =========================================================================

sub validate_re {
	my ( $v, $k, $ignore_case, $word_regexp ) = ( shift, shift || "<anon>", shift, shift );
	$v = "\\b($v)\\b" if $word_regexp;
	my $re = eval { $ignore_case ? qr/$v/im : qr/$v/m };
	die "Bad regexp: $k = $v\n" if $@;
	$re;
}

# If no patterns, assume the first item of the list is the pattern
push @patterns, shift @globs if ! @patterns && @globs;

# Validate all the patterns before combining into the single one
pod2usage unless @patterns;
map { validate_re $_, "pattern", $ignore_case } @patterns;

# Combine all patterns into the single pattern
$match_pattern = validate_re join("|", @patterns), "", $ignore_case, $word_regexp;

# Expand filename patterns
@files = map { glob } @globs;

# If the list of files is empty, assume reading from STDIN
push @files, "-" unless @files;

# Validate and setup the pattern identifying paragraphs
$break_of_para = validate_re $break_of_para, "break-of-paragraph";

# =========================================================================

warn <<DATA if $debug;
PARAGRAPH MATCHING CONTROL
    break-of-paragraph = $break_of_para
    begin-of-paragraph = $begin_of_para

MATCHING CONTROL
    match-pattern = $match_pattern
    invert-match  = $invert_match

FILES
    @files
DATA

# =========================================================================

my $para;

sub print_para {
	print $para if defined $para && ( $para =~ m/$match_pattern/ ^ $invert_match );
	$para = "";
}

sub grep_file {
	my $file = shift;

	if ( $file eq "-" ) {
		*FILE = *STDIN;
	} else {
		if ( -d $file ) {
			warn "Not a file: $file\n";
			return;
		}
		open FILE, $file or do {
			warn "Unable to read file: $file\n";
			return;
		};
	}

	while ( <FILE> ) {
		if ( m/$break_of_para/ ) {
			print_para;
			next unless $begin_of_para;
		};
		$para .= $_;
	}

	print_para if $para;

	close FILE unless $file eq "-";
}

# =========================================================================

grep_file $_ foreach ( @files );

# =========================================================================

# EOF

Thank you

In reply to Paragraph grep: request for testing, comments and feedbacks by siberia-man

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.