LinuxMatt has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have to process many kinds of text files in order to filter out contents. For each kind of file there is a list of susbstitution patterns to be applied on each line.

I would like to store all possible patterns in an array. Then I would call a subroutine on each line with the list of patterns to apply.

Example of pseudo-code:
@all_patterns = (s/#.*//, s/^\s+//, s/\s+$//, s/^Total//, s/^,// ); # and so on while(<FILE1>) { apply_patterns($_, (0,2,3)); # apply patterns 0, 2 and 3 } while(<FILE2>) { apply_patterns($_, (2,5,7)); # apply patterns 2, 5 and 7 } while(<FILE3>) { apply_patterns($_, (5,4,3)); # apply patterns 5, 4 and 3 }

How is it possible to write the 'apply_patterns' subroutine ?

Thanks for your help.

Replies are listed 'Best First'.
Re: Filtering files with lists of substitution patterns
by BrowserUk (Patriarch) on Aug 29, 2013 at 12:44 UTC

    Probably the simplest, if not the only, way would be to store subrefs in your array:

    my @all_patterns = ( sub{ $_[0] =~ s/#.*// }, sub{ $_[0] =~ s/^\s+// }, sub{ $_[0] =~ s/\s+$// }, sub{ $_[0] =~ s/^Total// }, sub{ $_[0] =~ s/^,// }, );

    Then your apply_patterns() becomes:

    sub apply_patterns { $_->( $_[0] ) for @all_patterns[ @_[ 1 .. $#_ ] ]; }

    You might be able to use the '_' prototype for the utility subs, but it gets messy.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Filtering files with lists of substitution patterns
by Athanasius (Archbishop) on Aug 29, 2013 at 13:25 UTC

    BrowserUk’s solution is cleaner; but, in the spirit of TMTOWTDI, here’s an approach using the qr// operator:

    #! perl use strict; use warnings; my @all_patterns = ( [ qr{#.*}, '' ], # 0. Comments [ qr{^\s+}, '' ], # 1. Initial whitespace [ qr{\s+$}, '' ], # 2. Final whitespace [ qr{^Total}, 'Sum' ], # 3. Initial "Total" --> "Sum" [ qr{^,}, '' ], # 4. Initial comma ); while (<DATA>) { my $string = apply_patterns($_, 0, 2, 3, 4); print "$string\n" if $string; } sub apply_patterns { my ($string, @indices) = @_; chomp $string; $string =~ s{$all_patterns[$_]->[0]} {$all_patterns[$_]->[1]}g for @indices; return $string; } __DATA__ First line # comment Total: Second line # Third line ,Fourth line

    Output:

    23:23 >perl 701_SoPW.pl First line Sum: Second line Fourth line 23:23 >

    Note: Depending on the nature of the filters, you may need to give careful attention to the order in which they are applied.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Filtering files with lists of substitution patterns
by kcott (Archbishop) on Aug 29, 2013 at 13:33 UTC

    G'day LinuxMatt,

    Here's the basis of how I might approach this:

    $ perl -Mwarnings -Mstrict -E ' my @test_data = ( "comment follows: #qwerty", " whitespace at start", "whitespace at end ", "Total: some total", ", starts with a comma", ); my @all_patterns = (q{#.*}, q{^\s+}, q{\s+$}, q{^Total}, q{^,}); my @files = qw{file1 file2 file3}; my %file_filters = ( file1 => [0, 2, 4], file2 => [2, 1, 3], file3 => [4, 3, 2] ); for my $file (@files) { say "File: $file Patterns: @all_patterns[@{$file_filters{$fil +e}}]"; my @this_files_data = @test_data; for my $line (@this_files_data) { say "Start: |$line|"; for (@{$file_filters{$file}}) { $line =~ s/$all_patterns[$_]//; } say "End: |$line|"; } } ' File: file1 Patterns: #.* \s+$ ^, Start: |comment follows: #qwerty| End: |comment follows:| Start: | whitespace at start| End: | whitespace at start| Start: |whitespace at end | End: |whitespace at end| Start: |Total: some total| End: |Total: some total| Start: |, starts with a comma| End: | starts with a comma| File: file2 Patterns: \s+$ ^\s+ ^Total Start: |comment follows: #qwerty| End: |comment follows: #qwerty| Start: | whitespace at start| End: |whitespace at start| Start: |whitespace at end | End: |whitespace at end| Start: |Total: some total| End: |: some total| Start: |, starts with a comma| End: |, starts with a comma| File: file3 Patterns: ^, ^Total \s+$ Start: |comment follows: #qwerty| End: |comment follows: #qwerty| Start: | whitespace at start| End: | whitespace at start| Start: |whitespace at end | End: |whitespace at end| Start: |Total: some total| End: |: some total| Start: |, starts with a comma| End: | starts with a comma|

    Be aware how the order of the patterns matters. At the start of the output, you'll see:

    File: file1 Patterns: #.* \s+$ ^, Start: |comment follows: #qwerty| End: |comment follows:|

    However, had those first two patterns been reversed, you'd see:

    File: file1 Patterns: \s+$ #.* ^, Start: |comment follows: #qwerty| End: |comment follows: |

    Note the extra space after "comment follows:".

    -- Ken

Re: Filtering files with lists of substitution patterns
by Eily (Monsignor) on Aug 29, 2013 at 13:27 UTC

    As long as the right part of the substitution if the same (nothing, in your case) you can easily use references to regexen with qr, and if you want a custom replacement, you would have to use the e switch for non constant strings.

    So in your case that would be :

    my @patterns = (qr/#.*/, qr{$^\s+}, qr{\s+$}, qr/^,/); sub apply_patterns { $_[0] =~ s/$_// for (@patterns[ @_[ 1 .. $#_ ] ]); }
    I usually don't like user subs to modify their parameters unless they are methods though. And calling the patterns by what they do rather than their number would make more sense. So you could actually just write:
    my %patterns = ( comment => qr/#.*/, leadingSpaces => qr{$^\s+}, trailingSpaces => qr{\s+$}, leadingComma => qr/^,/ ); while(my $line = <FILE1>) { $line =~ s/$_// for %patterns{qw/comment trailingSpaces/}; }

Re: Filtering files with lists of substitution patterns
by golux (Chaplain) on Aug 29, 2013 at 13:21 UTC
    Hi LinuxMatt,

    I'd suggest using qr to stringify each regex:

    my @all_patterns = ( qr/#.*/, qr/^\s+/, qr/\s+$/, qr/^Total/, qr/^,/, # ... and so on ... );

    Which avoids the overhead of calling an extra subroutine for each line of every file.

    Then the apply_patterns subroutine would something like this:

    sub apply_patterns { my ($string, @indexes) = @_; # Note $string is only a placeholder foreach my $idx (@indexes) { my $regex = $all_patterns[$idx]; s/$regex//; } }

    Update:   Modified apply_patterns to work with $_[0] directly.

    say  substr+lc crypt(qw $i3 SI$),4,5