Re: Remove all duplicates after regex capture

Here's a solution that doesn't depend on the order of the lines. It finds all the titles using a regex and counts them using a hash, and then selects the one that appears exactly once, warning or dieing if there isn't exactly one. You haven't specified a few things about the title: line, like whether there can be #'s in the titles, and what kind of text might appear after the closing #.

use warnings;
use strict;

my %mycorpus = (
    a => "<blah blah blah 
blah
title:#this is text I want 1#
blah blah blah",

    b => "blah
title:#this is text I do not want#
title:#this is text I want 2#
blah
title:#this is text I do not want#
blah",

    c => "blah blah
title:#this is text I do not want#
title:#this is text I want 3#
title:#this is text I do not want#
title:#this is text I do not want#
blah",
);

for my $filename (sort keys %mycorpus) {
    my %titles;
    $titles{$1}++
        while $mycorpus{$filename} =~ m{ ^ title:\# (.*) \# }xmg;
    my @once = grep { $titles{$_}==1 } sort keys %titles;
    die "No title found in $filename" unless @once;
    warn "More than one title found in $filename" if @once>1;
    my $title = $once[0];
    print "$title\n";
}

__END__

this is text I want 1
this is text I want 2
this is text I want 3
[download]

Comment on Re: Remove all duplicates after regex capture Select or Download Code

Replies are listed 'Best First'.
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 20, 2018 at 06:44 UTC
Brilliant, thank you very much! (Also, thanks for including examples of how to implement "die" and "warn": I rarely use these signals myself, but having just run your code on my data, it allowed me to identify a major error in the formatting of some of the data that would have caused a major headache later!).	[reply]