Regexing a block of text in between patterns

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm just trying to put together a small prototype for some fun and learning on text mining/parsing but I've come a cropper on regexing a larger block that I normally do (usually simple finding tasks or look ahead/look behind). I've got a series of texts where the only common point is Sidenote: <blah> at the beginning of each text which I want to use as the splitting point so that I can turn the whole file into a series of xml objects (in due course) to use later.
When I print out I get

but I had thought that a pattern match as above would give me everything starting with [Sidenote: in one div then start a new div for the next [Sidenote: but clearly there is something that I have misunderstood in the tutorials.

use strict;
use warnings;

my $text = "C:\\letters.txt";

my @letters;
open(IN, $text) || die "Can't open $text";
@letters = <IN>;
close(IN);
chomp @letters;

foreach my $indiv_note (@letters) {
  my $letter_text = ($indiv_note =~ /(^\[Sidenote (?: (?!^\[Sidenote).
+ )* )/);
  print "<div>$letter_text</div>\n";
}
[download]

The sort of data that I using is:

[Sidenote: The same.]

                                                     _Sunday Evening._

       *       *       *       *       *

I have at this moment got Pickwick and his friends on the Rochester
coach, and they are going on swimmingly, in company with a very
different character from any I have yet described, who I flatter mysel
+f

[Sidenote: Miss Hogarth.]

                            FURNIVAL'S INN, _Wednesday Evening, 1835._

MY DEAREST KATE,

The House is up; but I am very sorry to say that I must stay at home. 
+I
have had a visit from the publishers this morning, and the story canno
+t
[download]

Comment on Regexing a block of text in between patterns Select or Download Code

Replies are listed 'Best First'.
Re: Regexing a block of text in between patterns by ikegami (Patriarch) on Mar 10, 2009 at 14:36 UTC
`use strict; use warnings; my $text = "C:\\letters.txt"; open(my $fh_in, '<', $text) or die("Can't open $text: $!\n"); my $open = 0; while (<$fh_in>) { if (/^\[Sidenote /) { print("</div>\n") if $open; print("<div>\n"); $open = 1; } print; } print("</div>\n") if $open;` [download]	[reply] [d/l]
Re: Regexing a block of text in between patterns by mwah (Hermit) on Mar 10, 2009 at 14:39 UTC
You could split on 'Sidenote': `... my $filename = 'C:\letters.txt'; open my $fh, '<', $filename or die "Can't read $filename, $!"; my $text = do { local $/; <$fh> }; close $fh; my @sidenotes = split /(?=\[Sidenote:)/, $text; my $div_output = join "\n", map "<div>$_</div>", @sidenotes; print $div_output; ...` [download] Regards mwa	[reply] [d/l]
Re^2: Regexing a block of text in between patterns by CountZero (Bishop) on Mar 10, 2009 at 17:15 UTC
Or rather `my @sidenotes = split /\[Sidenote:.*?\]/, $text;` [download] Otherwise you will leave some junk at the beginning of each block. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^3: Regexing a block of text in between patterns by mwah (Hermit) on Mar 10, 2009 at 18:03 UTC
Hello CountZero Otherwise you will leave some junk at the beginning of each block I purposely included the [Sidenote] text after understanding the OP's specification this way. Maybe I didn't read closely enough ... Thanks & Regards mwa	[reply]
Re^4: Regexing a block of text in between patterns by CountZero (Bishop) on Mar 10, 2009 at 19:42 UTC
Re: Regexing a block of text in between patterns by codeacrobat (Chaplain) on Mar 10, 2009 at 16:22 UTC
How about a oneliner and no regexes :-) `perl -lne "print $first++? q{</div><div>}:q{<div>} and next if 0==inde +x $_, "[Sidenote"; print; print q{</div>} if eof" letters.txt` [download] `print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});`	[reply] [d/l] [select]
Re: Regexing a block of text in between patterns by kennethk (Abbot) on Mar 10, 2009 at 14:40 UTC
I'm having a little trouble understanding your goals here, so correct me if the following code does not address your issue. I am reading this as you wish to locate all occurrences of [Sidenote: ] and extract the unique text, displaying it surrounded by <div> tags. A (re)read of perlretut might be useful for you, but the following code does what I describe above. `use strict; use warnings; my $text = "C:\\letters.txt"; my @letters; open IN, '<', $text or die "Can't open $text"; @letters = <IN>; close(IN); chomp @letters; foreach my $indiv_note (@letters) { if ($indiv_note =~ /\[Sidenote\:\s(.?)\]/) { print "<div>$1</div>\n"; } }` [download] Note that the way I have written this, sidenotes cannot cross line boundaries, though doing this would be fairly trivial. Also note that if you want to include [ and ] in your posts, you should use html entities, i.e. `[` and `]`	[reply] [d/l] [select]