Regular expressions....

stan2004 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular expressions.... by dragonchild (Archbishop) on Jul 12, 2004 at 17:39 UTC
This smells like homework. Here's a few hints You'll want your regular expression to span multiple lines. This means you'll want the '.' special character to match against newlines. (This requires an option to be added to the regex.) Figure out what the "regular" parts of your "expression" are. For example, how would you tell a person who doesn't understand the example given how to solve the problem? That's how you have to tell the computer. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Regular expressions.... by Roy Johnson (Monsignor) on Jul 12, 2004 at 20:31 UTC
You don't need regular expressions (except maybe to throw away the header lines). `$/='.'; # Records are terminated by dots my $lnum = 0; while (<>) { tr/\n//d; # Get rid of any embedded newlines ++$lnum; print "$_ --->>> record $lnum\n"; }` [download] We're not really tightening our belts, it just feels that way because we're getting fatter.	[reply] [d/l]
Re^2: Regular expressions.... by graff (Chancellor) on Jul 13, 2004 at 02:04 UTC
`tr/\n//d; # Get rid of any embedded newlines` [download] You should really replace "\n" with a plain space character, so you don't get words like "isa" when you want "is a".	[reply] [d/l]
Re: Regular expressions.... by graff (Chancellor) on Jul 13, 2004 at 02:30 UTC
You're talking about sentence boundary detection. In some writing forms (e.g. Chinese), the end-of-sentence marker is unambiguous, and you can just read whole files or whole paragraphs into one scalar variable and use "split" with the distinctive end-of-sentence character (or use that character as the input record separator $/). But in others (e.g. English), the character used for the end-of-sentence marker is also used for lots of other things -- it is ambiguous, and it can be hard to tell, in any "algorithmic" way, whether a given period marks the end of a sentence or not. It's easy to spot all occurrences of the period character, but it takes a little more work to know which ones are sentence boundaries (and in some situations it takes a lot more work).	[reply]
Re: Regular expressions.... by ccn (Vicar) on Jul 12, 2004 at 17:59 UTC
`$data = "This is a test 1. This is a 2 test. This is a last test."; @lexemes = $data =~ /([^.]+\.)\s/g;` [download] see perldoc perlre for more details Update:* `\s` added -- any code is tested* unless otherwise stated	[reply] [d/l] [select]
Re^2: Regular expressions.... by ysth (Canon) on Jul 12, 2004 at 19:49 UTC
Add \s* after the \.	[reply]
Re: Regular expressions.... by theorbtwo (Prior) on Jul 12, 2004 at 18:36 UTC
I don't think the worde lexeme means what you think it does. (Also, it's "lexeme" and "lexemes" in English.) Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply]