in reply to Remove all duplicates after regex capture

Edit: note the flaw pointed out by Hautex. I did not understand the question correctly, and did not check uniqueness of the correct title.

How about this loop?

foreach my $filename (sort keys %mycorpus) { my $titles = ''; my $counter = 0; while ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/gm){ if($counter++){ last if $counter++; # skip the rest of the matches # can also be used to print warnings about multiple titles # and check $1 against $titles if they are the same, or not }else{ $titles = $1; # first match, we can store it, print "$titles \n"; # or print it out } } }

the output is

this is text I want 1 this is text I want 2 this is text I want 3

You can also replace the while with an if, and then it just matches the first title# .

foreach my $filename (sort keys %mycorpus) { my $titles = ''; if ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/m){ $titles = $1; print "$titles \n"; } }

The output is the same. I think you wanted the multiline regexp modifier to match a newline inside your filedump string.

edit: better structure to allow more post-work (commented what can be done there). Did also remove the /g (go) modifier in the "if" example as it is not needed there.

Replies are listed 'Best First'.
Re^2: Remove all duplicates after regex capture
by Maire (Scribe) on Aug 19, 2018 at 10:17 UTC
    This works perfectly! Thank you very much for your help/time. I feel a bit daft after seeing how (relatively) simple the solution actually was, but I've learned a lot from your code here. Thanks! EDIT: Ah, thanks for the reworking! It was probably my sleep-drived incoherant question that caused the confusion! Thanks again.

      Note that FreeBeerReekingMonk's solution only works because it relies on the blah on the line "title:#this is text I do not want# blah", and it only grabs the first title:. If I remove the blah or reorder the lines in the third example, it does not work.

        yes, that is true. My implementation is faulty.