in reply to Remove all duplicates after regex capture

Here's a solution that doesn't depend on the order of the lines. It finds all the titles using a regex and counts them using a hash, and then selects the one that appears exactly once, warning or dieing if there isn't exactly one. You haven't specified a few things about the title: line, like whether there can be #'s in the titles, and what kind of text might appear after the closing #.

use warnings; use strict; my %mycorpus = ( a => "<blah blah blah blah title:#this is text I want 1# blah blah blah", b => "blah title:#this is text I do not want# title:#this is text I want 2# blah title:#this is text I do not want# blah", c => "blah blah title:#this is text I do not want# title:#this is text I want 3# title:#this is text I do not want# title:#this is text I do not want# blah", ); for my $filename (sort keys %mycorpus) { my %titles; $titles{$1}++ while $mycorpus{$filename} =~ m{ ^ title:\# (.*) \# }xmg; my @once = grep { $titles{$_}==1 } sort keys %titles; die "No title found in $filename" unless @once; warn "More than one title found in $filename" if @once>1; my $title = $once[0]; print "$title\n"; } __END__ this is text I want 1 this is text I want 2 this is text I want 3

Replies are listed 'Best First'.
Re^2: Remove all duplicates after regex capture
by Maire (Scribe) on Aug 20, 2018 at 06:44 UTC
    Brilliant, thank you very much! (Also, thanks for including examples of how to implement "die" and "warn": I rarely use these signals myself, but having just run your code on my data, it allowed me to identify a major error in the formatting of some of the data that would have caused a major headache later!).