stan2004 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I'm newbie in perl regular expression and I'm confused: can anyone help me with good advise? I have a text file, so how I can split all text by pieces, for example +: ======================= this a text file ----------------------- This is a test 1. This is a 2 test. This is a last test. ======================= I want get such lexems: This is a test 1. --->>> one lexem. This is a 2 test. --->>> second lexem. This is a last test. ---->>> third lexem. and so on.... How can I do this touch regular expressions ? P.S. Sorry for bad english - this is not my native language.

20040712 Edit by ysth: change pre tags to code tags

Replies are listed 'Best First'.
Re: Regular expressions....
by dragonchild (Archbishop) on Jul 12, 2004 at 17:39 UTC
    This smells like homework.

    Here's a few hints

    • You'll want your regular expression to span multiple lines. This means you'll want the '.' special character to match against newlines. (This requires an option to be added to the regex.)
    • Figure out what the "regular" parts of your "expression" are. For example, how would you tell a person who doesn't understand the example given how to solve the problem? That's how you have to tell the computer.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Regular expressions....
by Roy Johnson (Monsignor) on Jul 12, 2004 at 20:31 UTC
    You don't need regular expressions (except maybe to throw away the header lines).
    $/='.'; # Records are terminated by dots my $lnum = 0; while (<>) { tr/\n//d; # Get rid of any embedded newlines ++$lnum; print "$_ --->>> record $lnum\n"; }

    We're not really tightening our belts, it just feels that way because we're getting fatter.
      tr/\n//d; # Get rid of any embedded newlines
      You should really replace "\n" with a plain space character, so you don't get words like "isa" when you want "is a".
Re: Regular expressions....
by graff (Chancellor) on Jul 13, 2004 at 02:30 UTC
    You're talking about sentence boundary detection. In some writing forms (e.g. Chinese), the end-of-sentence marker is unambiguous, and you can just read whole files or whole paragraphs into one scalar variable and use "split" with the distinctive end-of-sentence character (or use that character as the input record separator $/).

    But in others (e.g. English), the character used for the end-of-sentence marker is also used for lots of other things -- it is ambiguous, and it can be hard to tell, in any "algorithmic" way, whether a given period marks the end of a sentence or not. It's easy to spot all occurrences of the period character, but it takes a little more work to know which ones are sentence boundaries (and in some situations it takes a lot more work).

Re: Regular expressions....
by ccn (Vicar) on Jul 12, 2004 at 17:59 UTC
    $data = "This is a test 1. This is a 2 test. This is a last test."; @lexemes = $data =~ /([^.]+\.)\s*/g;

    see perldoc perlre for more details

    Update: \s* added

    --
    any code is tested unless otherwise stated

      Add \s* after the \.
Re: Regular expressions....
by theorbtwo (Prior) on Jul 12, 2004 at 18:36 UTC

    I don't think the worde lexeme means what you think it does. (Also, it's "lexeme" and "lexemes" in English.)


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).