As you all know, I'm writing a regex book. In an effort to get all the chapters up to a similar page count, I'm looking for example programs to use throughout the chapter to show how a concept plays in the actual code.
In particular, I need code that shows examples of:
- ^ and $ under the influence of /m
- global matching -- /g, /gc, \G, tricks with pos()
- substitution
The order of topics in the book is in that order, so if you have an idea for some s///g code, then it'd be found in the substitution chapter. I'm not asking for you to send me large portions of code or regexes, but rather to give me ideas. What do you use regexes for? What would you like to learn to do with regexes?
Contributions to the book will be met with the proper acknowledgements (just as blakem and Erudil and merlyn -- they're in my book).
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re (tilly) 1: Call for code samples!
by tilly (Archbishop) on Nov 15, 2001 at 18:15 UTC
|
A trick for the global matching that I have had occasion
to use in the past.
I was trying to parse a fixed delimited report. But the
locations of the columns was not fixed. However there
was a header, something like this:
Here Be Many Columns
-------- ---------- ---------
To locate the positioning of the columns I found the
line with the underlines and then:
my @space_loc;
while ($line =~ / /g) {
push @space_loc, pos($line);
}
You can also put together a small "parse engine" as a
demo of \G matching. Handling the CSV spec might be a
good example. | [reply] [d/l] [select] |
Re: Call for code samples!
by stefan k (Curate) on Nov 15, 2001 at 14:56 UTC
|
| [reply] [d/l] [select] |
Re: Call for code samples!
by jepri (Parson) on Nov 15, 2001 at 19:36 UTC
|
This is probably the nastiest regex I ever had to use. It's goal was to highlight Extended Letter Sequences in text. The ELS had already been found, this routine went through and turned the letters of the hidden word into uppercase. This is just a 'capitalise every nth letter' challenge, and I cheated by stripping out the spaces first:
#Block is the chunk of text, word is the 'hidden' word and sep is the
+number of characters in the text between each word
my ($block, $sep, $word)=@_;
my @let = split //, $word;
my $rep;
my $i=1;
foreach my $l (@let) {$l = "(".$l.")(.{$sep})";$rep.=
+ '\u$'.$i++.'$'.$i++;};
my $regexp2 = join "", @let;
my $ev = '$block =~ '."s/$regexp2/$rep/i;";
# print "Now scanning block with $regexp2 and replacemen
+t target $rep\n";
return $block, $word if eval $ev;
The output looks like this:
---This is the original text----------------------------------------
m at times; this thread, like that of Ariadne, when once
unraveled will conduct one through a lab
---The ELS 'FNORD' is in capitals---------------------
mattimesthisthreadlikethatoFariadNewhenOnceunRaveleDwillconductonethro
+ughalab
BTW if there's an obviously better way to do this I'd be delighted to hear about it.
____________________
Jeremy
I didn't believe in evil until I dated it. | [reply] [d/l] [select] |
Re: Call for code samples!
by lestrrat (Deacon) on Nov 15, 2001 at 14:41 UTC
|
I only remember this vaguely, but a while back I had a co worker ask me to write a simple parser to parse a proxy-cache log entry, sort of similar to Apache logs... where entries could be:
- all alpha-numeric characters w/o any whitespace
- double-quoted alpha-numeric characters ( whitespaces allowed ) where a double-quote is escaped with another double-quote
He encountered problems because he was trying to separate out the fields using split. Obviously, since an entry can contain spaces if it's quoted, you couldn't really split()
So the approach I took was something like this:
# untested... I'm sure this doesn't really work...
while( $line =~ m{\G\s*(\[\w0-9]+|"(?:[\w0-9]|"")+")}g ) {
do_something_with_match($1);
}
something like that. come to think of it, it's probably not a good example... oh well. my $0.02 | [reply] [d/l] |
Re: Call for code samples!
by chromatic (Archbishop) on Nov 16, 2001 at 00:51 UTC
|
I don't see anything in the substitution chapter that directly uses one of my favorite techniques -- calling subs on the rhs of a substitution. It may be obvious, if you think about it, but it's worth discussion. Everything (and Slash) handle links in textfields with something like this:
$text =~ s/\[([^\]]+)\]/parseLink($1)/eg;
sub parseLink {
my $link = shift;
my ($type, $title);
($type, $link, $title) = $link =~ m!(.+)://(.+)|(.+)!;
$type ||= '';
$title ||= $link;
return makeLink($link, $title, $type);
}
Somewhat hastily reconstructed out of memory. It's a little more optimized in the code, I believe. It's also a very nice way to handle more complex transformations. | [reply] [d/l] |
Re: Call for code samples!
by George_Sherston (Vicar) on Nov 15, 2001 at 14:12 UTC
|
In my simple site search script I posted recently I have three regexes working on the same lvalue to cut off the top and tail from some html and remove all the tags:
$file =~ s/^.*$startstring/$startstring/s;
$file =~ s/$endstring.*$//s;
$file =~ s/<[^>]*>/ /g;
I felt at the time this was a bit ugly, and should certainly have been doable in one regex, but didn't work out how - that's something I'd like to be able to do with a regex, though I agree it's not all that sophisticated.
§ George Sherston | [reply] [d/l] |
Re: Call for code samples!
by brianarn (Chaplain) on Nov 15, 2001 at 20:17 UTC
|
I know this is pretty simplistic, but is a good start possibly.
One of the tools we use where I work is essentially a checklist of things that need to be done that day, at certain times. It can generate a CSV, but instead of using ',' as a delimiter, it uses ', ' with that extra whitespace - so after using Text::ParseWords to split up the CSV, I'd run this regex on each element to remove the leading whitespace
map s/^\s//, @fields;
That's not a direct sample of the code (I don't have it on this machine) but that's what it looked like as far as I remember.
~Brian | [reply] [d/l] |
Re: Call for code samples!
by japhy (Canon) on Nov 15, 2001 at 20:22 UTC
|
Thanks for the ideas, all. Sadly, most of those are already in the text (that's because they're common). Tilly and jepri have ideas that aren't in my book, though, so I'll see where I can work them in.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] |
|
| [reply] |
Re: Call for code samples!
by jryan (Vicar) on Nov 15, 2001 at 20:49 UTC
|
Here is a snippet that I used in Quick Memos, that some of your Business-Oriented customers might find useful someday:
my($username1,$username2)=($username)x2;
$username1 =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/$1/g;
$username2 =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/$3/;
$username1 =~ tr/[a-z]/[A-Z]/;
$username2 =~ tr/[a-z]/[A-Z]/;
$username =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)/$username1$2 $username2
+$4/;
This will break an apart an email addresss of the form: first_name@domain.whatever and change it into the form:
First Name.
Looking back, I could have also done it like this:
$username =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/uc($1).$2." ".uc($3
+).$4/ge;
Thats slightly more (complic|obfusc)ated, but it still gets the job done.
| [reply] [d/l] [select] |
(Ovid) Re: Call for code samples!
by Ovid (Cardinal) on Nov 16, 2001 at 02:36 UTC
|
I once had to work with mapping a lot of text data in a spreadsheet to keys in a database. The database tables were for products and the currencies they were sold in. Rather than doing a lot of very tedious (and error-prone) replacement of the text data with the database IDs, I chose to have Perl do it for me.
I exported the text from the spreadsheed into a tab-delimited file. I then wrote a Perl script that read the tables and created hashes for them with the key being the text name and the id being the value. Then, the script opened the text file and rewrote every line using a substitution. Here is the relevant snippet:
open FILE, '+< products.txt' or die $!;
my @lines =
map { s/^([^\t]+)\t(.*)/
exists $product{$1} and exists $currency{$2}
? "$product{$1}\t$currency{$2}"
: "$1\t$2" /e; $_ }
<FILE>;
seek FILE, 0,0;
print FILE @lines;
truncate FILE, tell(FILE);
close FILE;
The text file was rather large and in about five minutes, I accomplished many hours of manual labor. I don't claim that this is terribly maintainable, but this was used to populate about four different lookup tables and then discarded.
Cheers,
Ovid
Update: I forgot to mention one little problem: I had gotten the columns reversed. Rather than messing around with the import function in the database, I did the following:
perl -pi.bak -e "s/(\d+)\t(\d+)/$2\t$1/" products.txt
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats. | [reply] [d/l] [select] |
Re: Call for code samples!
by runrig (Abbot) on Nov 15, 2001 at 23:14 UTC
|
My only offerings for /G and pos are this and this from the Pattern Finding thread (it does use pos() as an lvalue...).
And for /g, a thing to split up a block of text, preserving existing line breaks, but having a max line size: print $1,"\n"
while $text =~ /([^\n]{0,$max_line_size}})\n?/g;
Oh, and a '/g' example from a recent thread (of which you've probably already got something similar).
And this thing which reads from a socket until I have an end-token on a line: while ($sel->can_read) {
sysread($fh, $buffer, 2048, length($buffer));
next unless $buffer =~ /^(EOT|DONE)$/m;
....(process $buffer)
}
| [reply] [d/l] [select] |
Re: Call for code samples!
by petral (Curate) on Nov 16, 2001 at 07:51 UTC
|
To print the first and last line of each file:
$/ = undef;
while (<>) {
print "\n$ARGV\n", / (.+?) ^.*^ (.+?) \z/xms
}
To find the first and last match for a pattern:
(In this case, 'DT=20011105.112009'-style time-stamps in a log file)$dtrx = qr/[Dd][Tt]=(\d{8}\.\d{6})/;
$/ = undef;
while (<>) {
($aa, $z) = /$dtrx # get the first one
(?: # and maybe,
.* # after maybe many lines,
^.*? $dtrx # the first one on the last line
)? # that has one
/mxs;
$aa or next;
$z ||= ''; # or '' if no more
print "$ARGV\t $aa -- $z\n";
}
In this case there can be more than one such pattern on a line and it's the first one that is valid.
The point being that the /s makes a greedy, black-hearted .* for going straight to the end of the (presumedly long) file, ignoring linebreaks.   Then the /m (together with the ? to reign in dot-star) allows backtracking by line to find the first match on a line with matches.  
update:   These started out (obviously) as one liners, then got thrown into files.  The first time I actually looked at them was when I grabbed them to throw up here which, of course, lead to exploring MTOWTDI.  The first seems clearest this way. The second started out being far more complicated which is why it's in multi-line /x format.
  p
| [reply] [d/l] [select] |
Re: Call for code samples!
by argus (Acolyte) on Nov 15, 2001 at 23:15 UTC
|
Not sure if this is what you are looking for, but this is something I wrote to parse some log files while running load tests on DNS servers:
if(m/(([0-9]+)-((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))-(
+[0-9]+))\s+((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9].[0-9]{3})
+\s+(.*)\/(\d+.\d+.\d+.\d+)\/($lookingfor)\/(a).*/i)
{
## Notes:
# $1 = Whole Date (DD-MMM-YYYY)
# $2 = Day (DD)
# $3 = Month (MMM as text)
# $4 = Year (YYYY)
# $5 = Time (HH:MM:SS.sss)
# $6 = Queries info stuff (is not used)
# $7 = IP of requesting name server
# $8 = what we look for ($lookingfor variable)
# $9 = type of record (A/MX/CNAME)
}
not awfully complicated, though. | [reply] [d/l] |
Re: Call for code samples!
by hsmyers (Canon) on Nov 16, 2001 at 01:20 UTC
|
Nothing particularly spectacular here, just code that I use…
sub Text2HTML {
my $s = shift;
my $p;
my $m;
my $r;
my $mcd = new Text::DelimMatch '"';
if ($s) {
$s =~ s/&/&\;/g;
$s =~ s/\\\'/&\#39\;/g;
$s =~ s/</<\;/g;
$s =~ s/>/>\;/g;
$s =~ s/\.\.\./&hellip\;/g;
$s =~ s/--/&mdash\;/g;
while ($s =~ /\"/) {
($p,$m,$r) = $mcd->match($s);
if ($m) {
$m =~ /^\"(.*?)\"$/;
$s = $p.'“'.$1.'”'.$r;
}
}
}
return $s;
}
hsm
| [reply] [d/l] |
Re: Call for code samples!
by petral (Curate) on Nov 28, 2001 at 04:06 UTC
|
Another fun one:
  add up all the digits in a number:
$n = 6345789;
$n =~ s/\B/+/g;
$t = eval $n; # 42
$n = '634-5789';
$n =~ s/\B/+/g;
$t = eval $n; # 32
  p | [reply] [d/l] |
|
|