Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that looks like this:

__FRANK LICHTENBERG, Columbia University _[4]Pharmaceutical Knowledge-Capital Accumulation and Longevity_ BARBARA FRAUMENI and SUMIYE OKUBO, Bureau of Economic Analysis _[5]R&D in the National Income and Product Accounts: A First Look at its impact on GDP_

and I want to grab out the titles. The titles begin with the 4 or 5 and end with a "_" char. In this example, I want to grab out

Pharmaceutical Knowledge-Capital Accumulation and Longevity

and

R&D in the National Income and Product Accounts: A First Look at its impact on GDP

This only matches the 1-liner number 4:

my (@tmp, @titles); while (<IN>) { if ( my @tmp = m/\_[\d\](.+)_/g ) { push @titles, @tmp; } }

How can I write a reg-exp to match across newlines in some but not all instances? I want to get both titles in to my @titles array.

thanks

Replies are listed 'Best First'.
Re: how to reg-exp match across multiple lines?
by japhy (Canon) on Jul 08, 2002 at 20:27 UTC
    The primary problem is that you're only reading one line from the file at a time. That's why it breaks for multi-line entries. You have a couple of solutions:
    # read the whole file at once # XXX: updated (thanks, jmcnamara and VSarkiss) { local $/; $file = <IN>; } @titles = $file =~ /_\[\d]([^_]*)/g;
    or you could employ some logic:
    # read more lines as needed while (<IN>) { if (/_\[\d]([^_]*)_/) { push @titles, $1; } elsif (/_\[\d]([^_]*)/) { push @titles, $1; while (<IN>) { if (/([^_]*)/) { $titles[-1] .= $1 } last if /_/; } } }
    I think the first approach is easier to understand, generally.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: how to reg-exp match across multiple lines?
by joealba (Hermit) on Jul 08, 2002 at 21:00 UTC
    my $file = join '', <IN>; my @titles = $file =~ /_\[\d+\](.+?)_/sg;
      Avoid join "", <FH>; It has to separate the file into records first, build a list from them, only to then go on to glue that back together into a string. The universally applicable approach is what japhy already posted: { local $/ = undef; $file = <FH> } This temporarily disables the input record separator, so that a single read will gobble the entire file. Another approach which I like even better as it works less "noisily", however is not universally applicable: sysread FH, $file, -s $filename; This only works when you know the filename and have the filehandle positioned at the start of the file, but is the ideal approach when these prerequisites are fulfilled.

      Makeshifts last the longest.

        -s can take a filehandle as argument as well.
        open my $fh => "/etc/motd"; print -s $fh, "\n"; __END__ 13 $ wc -l /etc/motd 13

        Abigail

Re: how to reg-exp match across multiple lines?
by TexasTess (Beadle) on Jul 08, 2002 at 23:18 UTC
    Is there NEVER a chance that a line can begin with a _ and not at somepoint have a closing _? I mean..do ONLY the titles you're interested in start with _ and therefore anytime you have a leading _ you'll have an ending '_' ?

    TexasTess
    "Great Spirits Often Encounter Violent Opposition From Mediocre Minds" --Albert Einstein