Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Call for code samples!

by japhy (Canon)
on Nov 15, 2001 at 12:04 UTC ( #125523=perlmeditation: print w/replies, xml ) Need Help??

As you all know, I'm writing a regex book. In an effort to get all the chapters up to a similar page count, I'm looking for example programs to use throughout the chapter to show how a concept plays in the actual code.

In particular, I need code that shows examples of:

  • ^ and $ under the influence of /m
  • global matching -- /g, /gc, \G, tricks with pos()
  • substitution
The order of topics in the book is in that order, so if you have an idea for some s///g code, then it'd be found in the substitution chapter. I'm not asking for you to send me large portions of code or regexes, but rather to give me ideas. What do you use regexes for? What would you like to learn to do with regexes?

Contributions to the book will be met with the proper acknowledgements (just as blakem and Erudil and merlyn -- they're in my book).

Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re (tilly) 1: Call for code samples!
by tilly (Archbishop) on Nov 15, 2001 at 18:15 UTC
    A trick for the global matching that I have had occasion to use in the past.

    I was trying to parse a fixed delimited report. But the locations of the columns was not fixed. However there was a header, something like this:

    Here Be Many Columns -------- ---------- ---------
    To locate the positioning of the columns I found the line with the underlines and then:
    my @space_loc; while ($line =~ / /g) { push @space_loc, pos($line); }
    You can also put together a small "parse engine" as a demo of \G matching. Handling the CSV spec might be a good example.
Re: Call for code samples!
by stefan k (Curate) on Nov 15, 2001 at 14:56 UTC
    just going through some code of mine...
    • Say you got some variables stored in a hash: name=>value and whant to replace them in a text where they are used in a make-like form: $(VARIABLE)
      foreach (keys %global_variables) { $content =~ s/\$\($_\)/$global_variables{$_}/g; }
    • When you want to indent a part of text and calculated the indent already you might use:
      $content =~ s/\n/\n$indent /g;
    • Some really often used substitutions are the whitespace deletions: 1. All whitespaces: $line =~ s/\s+//g; 2. Only leading whitespaces: $line =~ s/^\s*//; 3. Only trailing whitespaces: $line =~ s/\s*$//;
    • Say you got a GUI (in my case it was Gtk) and want to display some messages. In the preview window though only the first line shall be displayed, marking that more is available:
      $text =~ s/\n.*/ [more\.\.\.]/s;
    That's it for the moment. The rest of my regexp-substitutions is mostly boring ;-)

    Update: Added the missing parens around VARIABLE

    Regards... Stefan
    you begin bashing the string with a +42 regexp of confusion

Re: Call for code samples!
by jepri (Parson) on Nov 15, 2001 at 19:36 UTC
    This is probably the nastiest regex I ever had to use. It's goal was to highlight Extended Letter Sequences in text. The ELS had already been found, this routine went through and turned the letters of the hidden word into uppercase. This is just a 'capitalise every nth letter' challenge, and I cheated by stripping out the spaces first:

    #Block is the chunk of text, word is the 'hidden' word and sep is the +number of characters in the text between each word my ($block, $sep, $word)=@_; my @let = split //, $word; my $rep; my $i=1; foreach my $l (@let) {$l = "(".$l.")(.{$sep})";$rep.= + '\u$'.$i++.'$'.$i++;}; my $regexp2 = join "", @let; my $ev = '$block =~ '."s/$regexp2/$rep/i;"; # print "Now scanning block with $regexp2 and replacemen +t target $rep\n"; return $block, $word if eval $ev;

    The output looks like this:

    ---This is the original text---------------------------------------- m at times; this thread, like that of Ariadne, when once unraveled will conduct one through a lab ---The ELS 'FNORD' is in capitals--------------------- mattimesthisthreadlikethatoFariadNewhenOnceunRaveleDwillconductonethro +ughalab

    BTW if there's an obviously better way to do this I'd be delighted to hear about it.

    I didn't believe in evil until I dated it.

Re: Call for code samples!
by lestrrat (Deacon) on Nov 15, 2001 at 14:41 UTC

    I only remember this vaguely, but a while back I had a co worker ask me to write a simple parser to parse a proxy-cache log entry, sort of similar to Apache logs... where entries could be:

    • all alpha-numeric characters w/o any whitespace
    • double-quoted alpha-numeric characters ( whitespaces allowed ) where a double-quote is escaped with another double-quote

    He encountered problems because he was trying to separate out the fields using split. Obviously, since an entry can contain spaces if it's quoted, you couldn't really split()

    So the approach I took was something like this:

    # untested... I'm sure this doesn't really work... while( $line =~ m{\G\s*(\[\w0-9]+|"(?:[\w0-9]|"")+")}g ) { do_something_with_match($1); }

    something like that. come to think of it, it's probably not a good example... oh well. my $0.02

Re: Call for code samples!
by chromatic (Archbishop) on Nov 16, 2001 at 00:51 UTC
    I don't see anything in the substitution chapter that directly uses one of my favorite techniques -- calling subs on the rhs of a substitution. It may be obvious, if you think about it, but it's worth discussion. Everything (and Slash) handle links in textfields with something like this:
    $text =~ s/\[([^\]]+)\]/parseLink($1)/eg; sub parseLink { my $link = shift; my ($type, $title); ($type, $link, $title) = $link =~ m!(.+)://(.+)|(.+)!; $type ||= ''; $title ||= $link; return makeLink($link, $title, $type); }
    Somewhat hastily reconstructed out of memory. It's a little more optimized in the code, I believe. It's also a very nice way to handle more complex transformations.
Re: Call for code samples!
by George_Sherston (Vicar) on Nov 15, 2001 at 14:12 UTC
    In my simple site search script I posted recently I have three regexes working on the same lvalue to cut off the top and tail from some html and remove all the tags:

    $file =~ s/^.*$startstring/$startstring/s; $file =~ s/$endstring.*$//s; $file =~ s/<[^>]*>/ /g;
    I felt at the time this was a bit ugly, and should certainly have been doable in one regex, but didn't work out how - that's something I'd like to be able to do with a regex, though I agree it's not all that sophisticated.

    George Sherston
Re: Call for code samples!
by brianarn (Chaplain) on Nov 15, 2001 at 20:17 UTC
    I know this is pretty simplistic, but is a good start possibly. One of the tools we use where I work is essentially a checklist of things that need to be done that day, at certain times. It can generate a CSV, but instead of using ',' as a delimiter, it uses ', ' with that extra whitespace - so after using Text::ParseWords to split up the CSV, I'd run this regex on each element to remove the leading whitespace
    map s/^\s//, @fields;
    That's not a direct sample of the code (I don't have it on this machine) but that's what it looked like as far as I remember.

Re: Call for code samples!
by japhy (Canon) on Nov 15, 2001 at 20:22 UTC
    Thanks for the ideas, all. Sadly, most of those are already in the text (that's because they're common). Tilly and jepri have ideas that aren't in my book, though, so I'll see where I can work them in.

    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      fwiw, japhy, i was reading a couple of chapters of your book this past weekend, and found them to be quite lucid. quite an accomplishment, 'cause if i'm not in the right mood regexs give me headaches (grin).

      seriously, it's nice work.

Re: Call for code samples!
by jryan (Vicar) on Nov 15, 2001 at 20:49 UTC
    Here is a snippet that I used in Quick Memos, that some of your Business-Oriented customers might find useful someday:
    my($username1,$username2)=($username)x2; $username1 =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/$1/g; $username2 =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/$3/; $username1 =~ tr/[a-z]/[A-Z]/; $username2 =~ tr/[a-z]/[A-Z]/; $username =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)/$username1$2 $username2 +$4/;

    This will break an apart an email addresss of the form: first_name@domain.whatever and change it into the form: First Name.

    Looking back, I could have also done it like this:
    $username =~ s/([a-z])([a-z]+)_([a-z])([a-z]+)\@.*/uc($1).$2." ".uc($3 +).$4/ge;
    Thats slightly more (complic|obfusc)ated, but it still gets the job done.
(Ovid) Re: Call for code samples!
by Ovid (Cardinal) on Nov 16, 2001 at 02:36 UTC

    I once had to work with mapping a lot of text data in a spreadsheet to keys in a database. The database tables were for products and the currencies they were sold in. Rather than doing a lot of very tedious (and error-prone) replacement of the text data with the database IDs, I chose to have Perl do it for me.

    I exported the text from the spreadsheed into a tab-delimited file. I then wrote a Perl script that read the tables and created hashes for them with the key being the text name and the id being the value. Then, the script opened the text file and rewrote every line using a substitution. Here is the relevant snippet:

    open FILE, '+< products.txt' or die $!; my @lines = map { s/^([^\t]+)\t(.*)/ exists $product{$1} and exists $currency{$2} ? "$product{$1}\t$currency{$2}" : "$1\t$2" /e; $_ } <FILE>; seek FILE, 0,0; print FILE @lines; truncate FILE, tell(FILE); close FILE;

    The text file was rather large and in about five minutes, I accomplished many hours of manual labor. I don't claim that this is terribly maintainable, but this was used to populate about four different lookup tables and then discarded.


    Update: I forgot to mention one little problem: I had gotten the columns reversed. Rather than messing around with the import function in the database, I did the following:

    perl -pi.bak -e "s/(\d+)\t(\d+)/$2\t$1/" products.txt

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: Call for code samples!
by runrig (Abbot) on Nov 15, 2001 at 23:14 UTC
    My only offerings for /G and pos are this and this from the Pattern Finding thread (it does use pos() as an lvalue...).

    And for /g, a thing to split up a block of text, preserving existing line breaks, but having a max line size:

    print $1,"\n" while $text =~ /([^\n]{0,$max_line_size}})\n?/g;
    Oh, and a '/g' example from a recent thread (of which you've probably already got something similar).

    And this thing which reads from a socket until I have an end-token on a line:

    while ($sel->can_read) { sysread($fh, $buffer, 2048, length($buffer)); next unless $buffer =~ /^(EOT|DONE)$/m; ....(process $buffer) }
Re: Call for code samples!
by petral (Curate) on Nov 16, 2001 at 07:51 UTC
    To print the first and last line of each file:
    $/ = undef; while (<>) { print "\n$ARGV\n", / (.+?) ^.*^ (.+?) \z/xms }
    To find the first and last match for a pattern:
    (In this case, 'DT=20011105.112009'-style time-stamps in a log file)
    $dtrx = qr/[Dd][Tt]=(\d{8}\.\d{6})/; $/ = undef; while (<>) { ($aa, $z) = /$dtrx # get the first one (?: # and maybe, .* # after maybe many lines, ^.*? $dtrx # the first one on the last line )? # that has one /mxs; $aa or next; $z ||= ''; # or '' if no more print "$ARGV\t $aa -- $z\n"; }
    In this case there can be more than one such pattern on a line and it's the first one that is valid.

    The point being that the /s makes a greedy, black-hearted .* for going straight to the end of the (presumedly long) file, ignoring linebreaks.   Then the /m (together with the ? to reign in dot-star) allows backtracking by line to find the first match on a line with matches.  

    update:   These started out (obviously) as one liners, then got thrown into files.  The first time I actually looked at them was when I grabbed them to throw up here which, of course, lead to exploring MTOWTDI.  The first seems clearest this way. The second started out being far more complicated which is why it's in multi-line /x format.

Re: Call for code samples!
by argus (Acolyte) on Nov 15, 2001 at 23:15 UTC
    Not sure if this is what you are looking for, but this is something I wrote to parse some log files while running load tests on DNS servers:

    if(m/(([0-9]+)-((?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec))-( +[0-9]+))\s+((?:0?[0-9]|1[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9].[0-9]{3}) +\s+(.*)\/(\d+.\d+.\d+.\d+)\/($lookingfor)\/(a).*/i) { ## Notes: # $1 = Whole Date (DD-MMM-YYYY) # $2 = Day (DD) # $3 = Month (MMM as text) # $4 = Year (YYYY) # $5 = Time (HH:MM:SS.sss) # $6 = Queries info stuff (is not used) # $7 = IP of requesting name server # $8 = what we look for ($lookingfor variable) # $9 = type of record (A/MX/CNAME) }
    not awfully complicated, though.
Re: Call for code samples!
by hsmyers (Canon) on Nov 16, 2001 at 01:20 UTC
    Nothing particularly spectacular here, just code that I use…
    sub Text2HTML { my $s = shift; my $p; my $m; my $r; my $mcd = new Text::DelimMatch '"'; if ($s) { $s =~ s/&/&amp\;/g; $s =~ s/\\\'/&\#39\;/g; $s =~ s/</&lt\;/g; $s =~ s/>/&gt\;/g; $s =~ s/\.\.\./&hellip\;/g; $s =~ s/--/&mdash\;/g; while ($s =~ /\"/) { ($p,$m,$r) = $mcd->match($s); if ($m) { $m =~ /^\"(.*?)\"$/; $s = $p.'&ldquo;'.$1.'&rdquo;'.$r; } } } return $s; }


Re: Call for code samples!
by petral (Curate) on Nov 28, 2001 at 04:06 UTC
    Another fun one:   add up all the digits in a number:
    $n = 6345789; $n =~ s/\B/+/g; $t = eval $n; # 42 $n = '634-5789'; $n =~ s/\B/+/g; $t = eval $n; # 32

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://125523]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2022-08-11 12:39 GMT
Find Nodes?
    Voting Booth?

    No recent polls found