Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm just trying to put together a small prototype for some fun and learning on text mining/parsing but I've come a cropper on regexing a larger block that I normally do (usually simple finding tasks or look ahead/look behind). I've got a series of texts where the only common point is Sidenote: <blah> at the beginning of each text which I want to use as the splitting point so that I can turn the whole file into a series of xml objects (in due course) to use later.
When I print out I get
but I had thought that a pattern match as above would give me everything starting with [Sidenote: in one div then start a new div for the next [Sidenote: but clearly there is something that I have misunderstood in the tutorials.
use strict; use warnings; my $text = "C:\\letters.txt"; my @letters; open(IN, $text) || die "Can't open $text"; @letters = <IN>; close(IN); chomp @letters; foreach my $indiv_note (@letters) { my $letter_text = ($indiv_note =~ /(^\[Sidenote (?: (?!^\[Sidenote). + )* )/); print "<div>$letter_text</div>\n"; }
The sort of data that I using is:
[Sidenote: The same.] _Sunday Evening._ * * * * * I have at this moment got Pickwick and his friends on the Rochester coach, and they are going on swimmingly, in company with a very different character from any I have yet described, who I flatter mysel +f [Sidenote: Miss Hogarth.] FURNIVAL'S INN, _Wednesday Evening, 1835._ MY DEAREST KATE, The House is up; but I am very sorry to say that I must stay at home. +I have had a visit from the publishers this morning, and the story canno +t

Replies are listed 'Best First'.
Re: Regexing a block of text in between patterns
by ikegami (Patriarch) on Mar 10, 2009 at 14:36 UTC
    use strict; use warnings; my $text = "C:\\letters.txt"; open(my $fh_in, '<', $text) or die("Can't open $text: $!\n"); my $open = 0; while (<$fh_in>) { if (/^\[Sidenote /) { print("</div>\n") if $open; print("<div>\n"); $open = 1; } print; } print("</div>\n") if $open;
Re: Regexing a block of text in between patterns
by mwah (Hermit) on Mar 10, 2009 at 14:39 UTC

    You could split on 'Sidenote':

    ... my $filename = 'C:\letters.txt'; open my $fh, '<', $filename or die "Can't read $filename, $!"; my $text = do { local $/; <$fh> }; close $fh; my @sidenotes = split /(?=\[Sidenote:)/, $text; my $div_output = join "\n", map "<div>$_</div>", @sidenotes; print $div_output; ...

    Regards

    mwa

      Or rather
      my @sidenotes = split /\[Sidenote:.*?\]/, $text;
      Otherwise you will leave some junk at the beginning of each block.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Hello CountZero

        Otherwise you will leave some junk at the beginning of each block

        I purposely included the [Sidenote] text after understanding the OP's specification this way.

        Maybe I didn't read closely enough ...

        Thanks & Regards

        mwa

Re: Regexing a block of text in between patterns
by codeacrobat (Chaplain) on Mar 10, 2009 at 16:22 UTC
    How about a oneliner and no regexes :-)
    perl -lne "print $first++? q{</div><div>}:q{<div>} and next if 0==inde +x $_, "[Sidenote"; print; print q{</div>} if eof" letters.txt

    print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});
Re: Regexing a block of text in between patterns
by kennethk (Abbot) on Mar 10, 2009 at 14:40 UTC
    I'm having a little trouble understanding your goals here, so correct me if the following code does not address your issue. I am reading this as you wish to locate all occurrences of [Sidenote: *] and extract the unique text, displaying it surrounded by <div> tags. A (re)read of perlretut might be useful for you, but the following code does what I describe above.

    use strict; use warnings; my $text = "C:\\letters.txt"; my @letters; open IN, '<', $text or die "Can't open $text"; @letters = <IN>; close(IN); chomp @letters; foreach my $indiv_note (@letters) { if ($indiv_note =~ /\[Sidenote\:\s(.*?)\]/) { print "<div>$1</div>\n"; } }

    Note that the way I have written this, sidenotes cannot cross line boundaries, though doing this would be fairly trivial. Also note that if you want to include [ and ] in your posts, you should use html entities, i.e. &#91; and &#93;