cdherold has asked for the wisdom of the Perl Monks concerning the following question:

fellow monks,

I'm trying to split up the following:

Company Says It Can Derive Stem Cells From the Placenta

By NICHOLAS WADE

A New Jersey company said that it had developed a method to extract a novel kind of stem cell from the placenta.

into distinct scalars for each section ...title, author and abstract. I tried to assign each region to scalars $1, $2 and $3 by defining the word boundry \b and three white spaces using the following code
if ($article =~ /\b(.*?)\s{3,}/gsm){ $title = $1; $author = $2; $abstract = $3}
I got the title out ($1) but got nothing for scalars $2 and $3. I figure I'm doing something very simple wrong here, but don't know where to turn in my early days learning perl. any help is warmly welcomed.

cdherold

Replies are listed 'Best First'.
Re: Assigning text sections to scalars
by hdp (Beadle) on Apr 25, 2001 at 11:11 UTC
    The easiest way to do this is with split: ($title, $author, $abstract) = split /\n\n/, $article

    If you insist on using a regex, try: ($title, $author, $abstract) = $article =~ /([^\n]+)(?:\n\n)?/g When I tried your regex, I encountered two cases: 1) with no whitespace after the abstract, in which case I ended up with nothing at all, and 2) with \n\n after the abstract, which left me with the entire block of text in $title; I can't really explain how you could extract only the title but not the others.

    By the way, the reason I say to use split in this case is because you don't really care what's in the strings you're extracting -- you only care about what's between them (namely the \n\n), often a strong indication that split is the correct tool to use.

    Note that the regex requires a lot more punctuation and is in general harder to comprehend at a glance than the split, but the functionality is essentially the same.

    hdp.

      I agree with you, I would also use split. It's cleaner.

      To make it even saver (minor points):

      ($title, $author, $abstract, undef) = split /\n\s*\n+/, $article;
      The \s* construction catches unvisible spaces and tabs, whereas \n+ catches faulty triple (or more) newlines.

      The undef makes sure that additional text doesn't screw up the abstract.

      Jeroen
      "We are not alone"(FZ)

        The undef is not necessary, as per split's documentation:

        When assigning to a list, if LIMIT is omitted, Perl supplies a LIMIT one larger than the number of variables in the list, to avoid unnecessary work.

        Essentially, this means that all the extra text you're worried about gets assigned to that nonexistent fourth variable in the list on the left hand side.

        Good thinking, but Perl beat you to it.

        hdp.

Re: Assigning text sections to scalars
by Rhandom (Curate) on Apr 25, 2001 at 10:13 UTC
    Assuming your sections are separated by double newlines (with optional DOS encoding):
    #!/usr/bin/perl my $txt = "Company Says It Can Derive Stem Cells From the Placenta By NICHOLAS WADE A New Jersey company said that it had developed a method to extract a +novel kind of stem cell from the placenta. "; if( $txt =~ /^(.*?)\n\r?\n(.*?)\n\r?\n(.+)$/s ){ print "[$1] [$2] [$3]\n"; }


    my @a=qw(random brilliant braindead); print $a[rand(@a)];
Re: Assigning text sections to scalars
by Anonymous Monk on Apr 25, 2001 at 13:40 UTC
    Make sure your variables $author & $abstract are declared globally - otherwise, they will be set inside the if{}, but not outside. You can easily check if this is the prob by putting print $author $abstract inside the if statement. If that returns values, that's the prob....
      You must be thinking of a different language.

      In Perl's scoping rules undeclared variables are global by default. For full details see Dominus' excellent article Coping with Scoping.