Re: Regex capture between word and punctuation
by swampyankee (Parson) on Mar 18, 2008 at 16:49 UTC
|
Why, O Why, does this look like a homework assignment? Never mind.
Hint 1: The regex you would be looking for is something like this: /\bit\b.*[,.?]$/i
Hint 2: use a regex to strip off everything before the first (last?) "it" and print the remainder of the line.
Note that hints are not complete solutions.
Also, your title is not sufficiently descriptive of the problem you are looking for assistance on.
emc
Information about American English usage here and here. Floating point issues? Please read this before posting.
| [reply] |
|
|
Thank you for the help. This isn't homework, I'm out of college, but this is from a book. And I'm not looking for a complete solution, just some help with my regular expression (and apparently a loop I missing :D). Please don't think I want hand holding, just a point in the right direction so to speak. Or someone to say I did/doing something wrong.
I apologize for my crappy thread title, I'll be more descriptive in the future. Thanks for the fast responses!
| [reply] |
Re: Regex capture between word and punctuation
by toolic (Bishop) on Mar 18, 2008 at 17:07 UTC
|
negzero7,
Since you are new to the Monastery, please read How do I compose an effective node title?. Thus far, all of your nodes have had meaningless titles.
Now to your question...
I believe this code satisfies your conditions of:
- Line ending in period, comma, or question mark.
- Line containing case-insensitive word "it"
#!/usr/bin/env perl
use warnings;
use strict;
open my $test_fh, '<', @ARGV or die "Can not open file $!\n";
while (<$test_fh>) {
if (/\b[Ii]t\b(.*)[.,?]$/) {
print "$1\n";
}
}
close $test_fh;
Update: driver8 astutely points out that my solution incorrectly excludes the ending punctuation. To capture the punctuation, as the OP desires, change the regex to:
if (/\b[Ii]t\b(.*[.,?])$/) {
| [reply] [d/l] [select] |
|
|
Hey nicely composed toolic. I apologize for the lack of decent thread posts, but thanks to everyone for posting anyway.
I have a few questions about the code you posted:
What does '<' mean/do in the open line?
Any chance you can explain the pieces in your regular expression? I can follow most of it but some elements are new to me I think.
| [reply] |
|
|
What does '<' mean/do in the open line?
Rather than just giving you the answer, perhaps it would be more useful to you if I showed you some ways of finding the answer. There is free documentation both online at open and at your *nix command prompt:
perldoc -f open
Read about the 3-argument form of open, specifically the MODE.
Any chance you can explain the pieces in your regular expression?
Browse through perlretut. YAPE::Regex::Explain can explain it better than I can:
The regular expression:
(?-imsx:\b[Ii]t\b(.*)[.,?]$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
[Ii] any character of: 'I', 'i'
----------------------------------------------------------------------
t 't'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[.,?] any character of: '.', ',', '?'
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Here is how I got that explanation
#!/usr/bin/env perl
use warnings;
use strict;
use YAPE::Regex::Explain;
my $re = '\b[Ii]t\b(.*)[.,?]$';
my $parser = YAPE::Regex::Explain->new($re);
print $parser->explain;
| [reply] [d/l] [select] |
|
|
Re: Regex capture between word and punctuation
by halfcountplus (Hermit) on Mar 18, 2008 at 16:54 UTC
|
Here's what works (note that since you don't need help opening the file, i have included "test.txt" as DATA:
#!/usr/bin/perl -w
use strict;
while (<DATA>) {
if ($_ =~ /\.|\,|\?$/ && $_ =~ /it/) {
my @line=split / it /,$_;
print "$line[1]";
}
}
__DATA__
Tim created the Module List in August 1994 and maintained it manually
+till
April 1996. By that time Andreas had implemented the Perl Authors Uplo
+ad
Server (PAUSE) and it was happily feeding modules through to the CPAN
+archive sites.
Since PAUSE held a database of module information which could be maint
+ained by module authors
it made sense for the module listing part of the Module List to be bui
+lt
from that database. In April 1996 Andreas took over the automatic post
+ing of
the Module List and I now maintain the other parts of the text. We pla
+n to add
value to the automation over time.
Here's why:
line 1: note that my perl is in a different place than yours. "perl -w" is use warnings;
line 5: uses && to get rid of your nested if.
also, i use $/ and not /s (i think this is the same)
line 6: split the line into a two array element on " it ", nb. the whitespace deals with "sites"
line 7: now just print the second element of the array (first=0, second=1)
There may be some weakness if you apply this to a different piece of text, but it works on that one (yes, i tested it!)
| [reply] [d/l] |
Re: Regex capture between word and punctuation
by glide (Pilgrim) on Mar 18, 2008 at 16:47 UTC
|
use warnings;
use strict;
while (<DATA>) {
if (my ($part)= $_ =~ m/(it\s+.*(\.|\,|\?))/smx) {
print "$part\n";
}
}
__DATA__
Tim created the Module List in August 1994 and maintained it manually
+till
April 1996. By that time Andreas had implemented the Perl Authors Uplo
+ad
Server (PAUSE) and it was happily feeding modules through to the CPAN
+archive sites.
Since PAUSE held a database of module information which could be maint
+ained by module authors
it made sense for the module listing part of the Module List to be bui
+lt
from that database. In April 1996 Andreas took over the automatic post
+ing of
the Module List and I now maintain the other parts of the text. We pla
+n to add
value to the automation over time.
ps: the subject of the post it's important, choose a good one | [reply] [d/l] |
Re: Regex capture between word and punctuation
by ww (Archbishop) on Mar 18, 2008 at 19:11 UTC
|
Seconding the replies above, please note that swampyankee, halfcountplus and toolic have each addressed the problem of ensuring the regex does NOT match sit ..., fit. .., split ... or even fergeddaboddit ....
Conversely, =~ m#(it\s# would match all of those, whereas each of the replies above avoids that potential problem by requiring a wordboundry -- in this case, a space -- (#\bit\b# or # it # on both sides of any string, "it". | [reply] [d/l] [select] |
Re: Regex capture between word and punctuation
by dwm042 (Priest) on Mar 18, 2008 at 19:25 UTC
|
Mine is a slightly different method than the others, a kind of 'c' like solution. I'm personally fond of the use of 'split' above, myself.
#!/usr/bin/perl
use warnings;
use strict;
while(<DATA>) {
chomp;
next unless ( m/(\.|\,|\?)$/ );
my $len = length($_);
my $pos = index(lc($_), " it ");
if ( $pos > -1 ) {
my $delta = $len - $pos - 4;
my $string = substr($_,$pos + 4,$delta);
print "Substring = $string\n";
}
}
__DATA__
Tim created the Module List in August 1994 and maintained it manually
+till April 1996.
By that time Andreas had implemented the Perl Authors Upload Server (P
+AUSE) and it was happily feeding modules through to the CPAN archive
+sites.
Since PAUSE held a database of module information which could be maint
+ained by module authors it made sense for the module listing part of
+the Module List to be built from that database.
And I may have reformatted the text, but my output looks like:
C:\Code>perl substring.pl
Substring = manually till April 1996.
Substring = was happily feeding modules through to the CPAN archive si
+tes.
Substring = made sense for the module listing part of the Module List
+to be built from that database.
Update: fixed the case sensitivity issue.
| [reply] [d/l] [select] |
Re: Regex capture between word and punctuation
by driver8 (Scribe) on Mar 19, 2008 at 00:16 UTC
|
I suspect you will soon want to know:
How do I write a program that prints only words
(not the entire line) in the file part3.txt
(attached) on a word boundary that have the
letter "p" or "P" in them?
the part3.txt is
" CPAN stands for comprehensive Perl Archive Network.
^ and $ are used as anchors in a regular expression.
/pattern/ is a pattern match operator.
Perl is very easy to learn.
Enter 'H' or 'h' for help."
In that case, please see Pattern matching, and don't bother posting a new question.
Out of curiosity, what book are you using? It seems to be a poor one (based on the number of times these questions have been asked).
-driver8
Update: It appears you didn't get my message in time. | [reply] [d/l] |
|
|
Nope sorry I didn't check this til just now. I appreciate the links to other examples, but i don't understand what I'm looking for exactly. I think my pattern matching for the other topic is fine, but my split function needs something.
I would prefer if I could someone to spell it out nicely for me. If I am in the wrong place for this kind of stuff, please point me to where I should be asking these types of questions.
Please note I have been trying to figure this out myself. If I can't figure it out after a while of trying, I figure it's time to ask
| [reply] |