This is essentially the same approach haukex uses, but with some different features:

Script remove_dup_lines_1.pl:
use warnings; use strict; use Data::Dump qw(dd); my %mycorpus = ( a => "<blah blah blah blah title:*this is text I want 1* blah blah blah", b => "blah title:*this is text I do not want* title:*this is text I want one* blah title:*this is text I do not want* blah", c => "blah blah title:*this is text I do not want* title:*this is also text I do not want* title:*this is text I want A* title:*This Is Text I Do Not Want* title:*this is ALSO text I DO NOT WANT* extra stuff title:*this is text I want over multiple lines B* more stuff yada title:*this \t \t is\ttext I \t\t\t do not want* title:*this is text I want C* blah", ); my $open_delim = my $close_delim = do { my $delim = '*'; # single delimiter character die "bad delimiter '$delim'" unless length($delim) == 1; quotemeta $delim; # can be any character }; my $rx_intro = qr{ title: $open_delim }xms; my $rx_outro = qr{ $close_delim }xms; my $rx_body = qr{ [^$close_delim]* }xms; # print "$rx_intro $rx_body $rx_outro \n"; for my $filename (sort keys %mycorpus) { my $content = $mycorpus{$filename}; my %titles; my $order; while ($content =~ m{ $rx_intro ($rx_body) $rx_outro }xmsg) { my $title = $1; my $normal = normalize($title); @{ $titles{$normal} }{ qw(title order count) } = ($title, ++$order, ++$titles{$normal}{count}); } # dd \%titles; print "$filename: '$_->{title}' \n" for sort { $a->{order} <=> $b->{order} } grep $_->{count} == 1, values %titles ; } exit; # subroutines ###################################################### sub normalize { my ($string, ) = @_; $string =~ tr{ \t\n}{ }s; # squeeze spaces/tabs/newlines to 1 spa +ce $string = lc $string; return $string; }
Output:
c:\@Work\Perl\monks\Maire>perl remove_dup_lines_1.pl a: 'this is text I want 1' b: 'this is text I want one' c: 'this is text I want A' c: 'this is text I want over multiple lines B' c: 'this is text I want C'


Give a man a fish:  <%-{-{-{-<


In reply to Re: Remove all duplicates after regex capture by AnomalousMonk
in thread Remove all duplicates after regex capture by Maire

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.