Re: Remove all duplicates after regex capture

Edit: note the flaw pointed out by Hautex. I did not understand the question correctly, and did not check uniqueness of the correct title.

How about this loop?

foreach my $filename (sort keys %mycorpus) {
  my $titles = '';
  my $counter = 0;
    while ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/gm){
      if($counter++){
        last if $counter++; # skip the rest of the matches
        # can also be used to print warnings about multiple titles
        # and check $1 against $titles if they are the same, or not
      }else{
        $titles = $1; # first match, we can store it,
        print  "$titles \n"; # or print it out
      }
      
  }
}
[download]

the output is

this is text I want 1 
this is text I want 2 
this is text I want 3
[download]

You can also replace the while with an if, and then it just matches the first title# .

foreach my $filename (sort keys %mycorpus) {
  my $titles = '';
    if ($mycorpus{$filename} =~ /title:#(.*?)#\s*$/m){
      $titles = $1;
      print  "$titles \n";    
  }
}
[download]

The output is the same. I think you wanted the multiline regexp modifier to match a newline inside your filedump string.

edit: better structure to allow more post-work (commented what can be done there). Did also remove the /g (go) modifier in the "if" example as it is not needed there.

Comment on Re: Remove all duplicates after regex capture Select or Download Code

Replies are listed 'Best First'.
Re^2: Remove all duplicates after regex capture by Maire (Scribe) on Aug 19, 2018 at 10:17 UTC
This works perfectly! Thank you very much for your help/time. I feel a bit daft after seeing how (relatively) simple the solution actually was, but I've learned a lot from your code here. Thanks! EDIT: Ah, thanks for the reworking! It was probably my sleep-drived incoherant question that caused the confusion! Thanks again.	[reply]
Re^3: Remove all duplicates after regex capture by haukex (Archbishop) on Aug 19, 2018 at 10:49 UTC
Note that FreeBeerReekingMonk's solution only works because it relies on the `blah` on the line `"title:#this is text I do not want# blah"`, and it only grabs the first `title:`. If I remove the `blah` or reorder the lines in the third example, it does not work.	[reply] [d/l] [select]
Re^4: Remove all duplicates after regex capture by FreeBeerReekingMonk (Deacon) on Aug 19, 2018 at 19:28 UTC
yes, that is true. My implementation is faulty.	[reply]