comment on

This is essentially the same approach haukex uses, but with some different features:

Titles are normalized before being added to the %titles duplication detection hash according to their normalized form. This means that, in this implementation,
'This is a title'
and
'This IS A Title'
are the same for the purpose of rejecting duplicate titles. This may or may not be what Maire wants; it's just an example of what's possible. The extent of normalization is easily adjusted.
Titles are allowed to wrap from one line to the next. This applies to both unique and duplicate titles. Again, Maire may not want this, and it's easily turned off.
A title may not be the only thing on a line. Another thing that's easy to change.
A '*' is used in place of the original '#' character to delimit titles. This is done only to demonstrate that any character, even a regex metacharacter, could be used as a delimiter. (It wouldn't be that hard to change the regexes to accommodate multi-character delimiter sequences.)

Script remove_dup_lines_1.pl:

use warnings;
use strict;

use Data::Dump qw(dd);

my %mycorpus = (
    a => "<blah blah blah
blah
title:*this is text I want 1*
blah blah blah",

    b => "blah
title:*this is text I do not want*
title:*this is text I want one*
blah
title:*this is text
I do not want*
blah",

    c => "blah blah
title:*this is text I do not
want*
title:*this is also text I do not want*
title:*this is text I want A*
title:*This Is Text I Do Not Want*
title:*this is ALSO
text I DO NOT WANT*
extra stuff title:*this is text
I want over multiple lines B* more stuff
yada
title:*this \t \t is\ttext   I \t\t\t  do  not  want*
title:*this is text I want C*
blah",
);

my $open_delim  =
my $close_delim = do {
    my $delim = '*';  # single delimiter character
    die "bad delimiter '$delim'" unless length($delim) == 1;
    quotemeta $delim;  # can be any character
    };
my $rx_intro = qr{ title: $open_delim    }xms;
my $rx_outro = qr{        $close_delim   }xms;
my $rx_body  = qr{      [^$close_delim]* }xms;
# print "$rx_intro $rx_body $rx_outro \n";

for my $filename (sort keys %mycorpus) {

    my $content = $mycorpus{$filename};

    my %titles;
    my $order;

    while ($content =~ m{ $rx_intro ($rx_body) $rx_outro }xmsg) {

        my $title  = $1;
        my $normal = normalize($title);

        @{ $titles{$normal} }{ qw(title order count) } =
            ($title, ++$order, ++$titles{$normal}{count});

        }

  # dd \%titles;

    print "$filename: '$_->{title}' \n" for
        sort   { $a->{order} <=> $b->{order} }
        grep   $_->{count} == 1,
        values %titles
        ;

    }

exit;

# subroutines ######################################################

sub normalize {

    my ($string,
        ) = @_;

    $string =~ tr{ \t\n}{ }s;  # squeeze spaces/tabs/newlines to 1 spa
+ce
    $string =  lc $string;

    return $string;

    }
[download]

Output:

c:\@Work\Perl\monks\Maire>perl remove_dup_lines_1.pl
a: 'this is text I want 1'
b: 'this is text I want one'
c: 'this is text I want A'
c: 'this is text
I want over multiple lines B'
c: 'this is text I want C'
[download]

Give a man a fish: <%-{-{-{-<

In reply to Re: Remove all duplicates after regex capture by AnomalousMonk
in thread Remove all duplicates after regex capture by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.