RFC: Text Processing for Chemists Tutorial

This is a talk i'm giving later today to a bunch of computational chemistry students to introduce them to what you can do w/text files on the commandline. Originally it was going to be just perl, but then I decided to start w/the standard utilities, so the perl isn't until the second half, but it's there in force ;).

The goal isn't to have them walking out of the 1 hour seminar knowing how to use all the tools, but hopefully knowing that the tools are out there and several starting/jumping of fpoints for using them.

Any & all comments appreciated, espcially with the goal of making this (either as-is or modified) into a piece suitable for the Tutorials section.

=pod

=head1 Introduction to Text Utilities

UGA Chemistry Summer Lecture Series, June 2, 2006

=head1 Intended Audience

Chemistry summer research students (upperclassman & graduate) doing co
+mputational tasks.

=head1 Abstract

This guide is an introduction, demonstration, and reference point to e
+xpose Chemistry summer research students to the realm of text process
+ing via the command-line interface and all of the power and efficienc
+y it offers. Using standard text utilities will lead up to more advan
+ced scripting with Perl.

=head1 Environment

Linux running the bash shell

=head1 Author & Presentor

David Westbrook  David Westbrook, E<lt>dwestbrook@gmail.comE<gt>

=head1 Warning Label

This presentation is a crash course.  The primary goal is exposure to 
+the freely available tools and methods so that they can be learned an
+d utilized in the future as the need arises.

=head1 The Toolbox

First we will review all of the basic tools we have available by defau
+lt for working with and manipulating text files.

=over 4

=item man

THIS IS ONE OF THE MOST IMPORTANT COMMANDS.  "man" stands for "Manual"
+ and provdes documentation for all of the commands in this document.

  man man
  man cd
  man ls

=back

=head2 Seeing files

Before we can work with a file, we have to know where it is.

=over 2

=item cd

This is used to Change Directories.

  cd /tmp
  cd ~
  cd -
  cd ..
  cd ../../foo/bar

=item pwd

This Prints the Working Directory -- i.e. tells you what directory you
+ are in.

=item ls

LiSts files.  Shows the contents of the current directory. See the man
+ page for the many options.

  ls
  ls /tmp
  ls -lart
  ls -lart ../foo

=item locate

Searches all the filenames on the system for a given search string.  T
+he availability of this command can be system-dependent, and there ar
+e several caveats: 1) it works off a database that is generated night
+ly, so files created today won't be found 2) It respects permissions,
+ so files that you can't read won't be found  3) It only works on the
+ local filesystem so files in mounted directories won't be found.

  locate host
  locate pass
  locate etc/pa

=item find

Recursively lists all of the files in the given (defaults to current) 
+directory. Many, many options in the man page.

  find
  find /tmp
  find -type f /tmp
  find /tmp -type f -maxdepth 1 -mtime +6 -exec echo {} \;

=back


=head2 File information

We need to be able to obtain basic information about a file to know wh
+at we're working with.

=over 2

=item ls -l

List the details of a file. This includes the permissions, owner, grou
+p, size, and last modified date.

=item wc

WordCount.  Displays the number of lines, words, and bytes in a file.

=item file

Attempts to determine the file's contents -- e.g. html or text or bina
+ry or excel, etc.

=item identify

Similar to L<file> but for graphics files. Will include size and color
+ information.  This is provided by the L<http://www.imagemagick.org> 
+toolset.

=back

=head2 File contents

Now we can begin to work with the file's actual contents. Note that "p
+rint" means "output to the screen" in this context.

=over 2

=item cat

Just prints out the contents of each file it's given. (Same as I<type>
+ in DOS)

  cat file1
  cat file1 file2
  cat -n file1

=item less

Shows a file one screen at a time (known as a 'pager'). (There is also
+ a command 'more', but it has less features than L<less>.)

=item head

Prints out the first N lines of a file.

  head file1
  head -2 file1
  head -3 file1 file2

=item tail

Prints out the last N lines of a file.

  tail file1
  tail -2 file1
  tail -3 file1 file2
  tail +5 file1

=item grep

Search files for a given string and print the matching lines.  See man
+ page for many, many options.

  grep foo file1
  grep -i foo file1
  grep -l foo *
  grep -n foo file1
  grep -A3 foo file1

=item strings

Prints out all the words found in a file. Especially useful on binary 
+files for finding the pieces of text buried in its compiled contents.

  strings a.out
  strings foo.exe
  strings /bin/ls

=item sort

Orders (i.e. sorts!!) the lines of a file.  See man page for details.

  sort file1

=item uniq

Displays just the unique lines of a file.  The file must be sorted.

=item cut

Print just the specified columns of a file.  See man page for details.


  cut -f1,3 file1
  cut -f1,5,6 -d: file1

=item split

Split a file into chunks. See man page.

=item join

Combine two files based on a common column. See man page.

=back

=head2 File Management

These are listed for quick reference -- refer to the man pages for fur
+ther details.

=over 2

=item cp

=item mv

=item rm

=item mkdir

=item rmdir

=back

=head2 Editors

=over 2

=item vi

vi (or vim) does have a little bit of a learning curve, but is well wo
+rth it -- it is very powerful and is available on pretty much every *
+nix machine (there is gvim for Windows, too).  It is best to find a r
+eference (book or online tutorial) for the commands.  Some essentials
+:

  :q   quits
  :q!  quits w/o saving
  :w save
  :w!  force save
  :wq  saves and quits
  i    enter editing (insert) mode
  ESC  return to command mode
  /foo search for foo
  :s/foo/bar  replace foo with bar

Others that you'll want to know (in no particular order):

  yy p dd dw w :$ :1 :55 :s/foo/bar/g :%s/foo/bar :%s/foo/bar/g :5,10s
+/foo/bar n N :n :N :wn :wN ctrl-g s x

=item view

Same as L<vi> but starts it in read-only mode.  It's a very good habit
+ to use L<view> when you know you're only looking at a file so you do
+n't accidentally change it.

=back

=head2 Miscellaneous

=over 2

=item clear

clears the screen -- same as cls in  DOS

=item echo

Just displays its arguments to the screen (same as DOS).

  echo blah
  echo path=$PATH
  echo -n foo
  echo -e foo\tbar\nstuff

=item touch

Updates the last modified timestamp on a file.  If file doesn't exist,
+ creates a 0-byte file.

=item seq

Prints out sequences of numbers.  See options. Also see L<Loops> for e
+xample usage.

  seq 1 10
  seq 1 10 2

=item cal

Prints out a nicely formatted calendar.

  cal
  cal 7 2006

=item look

Prints out words from a dictionary file that start with the given stri
+ng.

  look foo
  look princ

=item date

Prints out the date.  See options in man page for various formats.

  date
  date -e

=item sleep

Pauses for N seconds.

  sleep 2

=item alias

Define your own commands.

  alias cls=clear

=item wget

Gets files from the web (or ftp). Extremely useful and powerful -- can
+ mirror entire sites. See man page for lots of options.

  wget http://foo.example.com/blah.tar.gz
  wget ftp://foo.example.com/blah.tar.gz

=item curl

Another tool to get remote files (in case wget isn't available).

  curl --remote http://foo.example.com/blah.tar.gz

=item lynx

A text-based web browser! USeful for simple pages, testing connections
+, sucking down source code, converting html to text, or downloading f
+iles from HTTP or FTP sites.

=back

=head1 Combining Tools

=head2 Pipes

'|' is the "pipe" character.  It is uses to take the output from the l
+eft-hand side (LHS) and give/shove ("pipe") it as input to the right-
+hand side (RHS).  Here are several example tasks that consectutively 
+use two or more of the tools we have discussed.

=head3 Find a word that starts with "c" and has a "mel" in it.

  look c | grep mel

=head3 See if the word FOO is in the first 3 lines of a file.

  head -3 file1 | grep FOO

=head3 Take the lines that have FOO, look at just the first column, an
+d show the unique values

  grep FOO file1 | cut -f1 | sort -u

=head3 Determine the location of a file with FOO in its name.

  find | grep FOO
  locate FOO

=head2 Redirection

The output of a command can be saved to another file.

=head3 Output

  grep FOO file1 > foo_lines

=head3 Append Output

  grep FOO file1 >> foo_lines
  grep FOO file2 >> foo_lines

=head3 Input

  grep FOO file1
  cat file1 | grep FOO
  grep FOO < file1

  a.out < input.dat

=head3 Backticks

  echo `date`
  ls -lart `find | grep FOO`

=head1 Bash

A commonly used shell (although there are many) is bash.  Besides just
+ running regular commands, it also supports setting/retrieving of var
+iables and loops and conditionals.  The man page is extensive.

=head2 Variables

  foo=Bar
  echo my foo var = '$foo'

We won't discus it here, but bash supports variable mangling. e.g.

  foo=blah.stuff.bar
  echo $foo
  echo ${foo%%.*}
  echo ${foo##*.}
  echo ${foo#*.}

=head2 Loops

  for s in foo bar stff ; do echo s=$s ; done

  for s in foo bar stff
  do
        echo s=$s
  done

  for n in `seq 1 5` ; do touch /tmp/f$n.txt ; done

=head1 Text Processing

There are three powerful interpreters that can be used to filter text.
+ The man pages for each contain a wealth of information.

=head2 sed

Useful & efficient for substituions.

  sed s/1/AAAA/g /etc/hosts

=head2 awk

Useful for working with columns.

  awk '{print $2,$1}' /etc/hosts

=head2 perl

Useful for everything :)  We'll come back to it a moment, but here are
+ examples that serve as replacements for many of the above commands.

  # echo
  perl -pe '' $f

  # sed s///
  perl -pe 's/1/AAAA/g' $f

  # cut/awk
  perl -ane 'print $F[1], " ", $F[0]' $f

  # grep
  perl -ne 'print if /foo/' $f

  # head
  perl -ne 'print if $. <= 10' $f

=head1 Regular Expressions

What is a regular expression (regex)?  It is just a pattern of somethi
+ng you want to match in a string.  And that pattern can be anything, 
+simple or very complex.

What uses them? grep/egrep, sed, vi, and perl (and other lnaguages) No
+te that there are several different "flavors" of regex depending on w
+hat's using it, but they are all more-or-less the same.  We will focu
+s on perl regex.

  men perlretut
  man perlre

Regular expressions can be scary at first so we will try to look at th
+em from a general overview:

=head2 Matching

  /a/

The I</>'s simply mark our pattern (note that perl can use anything fo
+r the delimitersi with the I<m//> operator, e.g. I<m#a#>, I<m!a!>) an
+d the I<a> is what we're matching, which is just the lower-case lette
+r 'a'.

  /a*b/

This is 0 or more 'a' followed by a 'b'.

  /a+b/
This is one or more 'a' followed by a 'b'

  /a\+b/

This is literally "a+b" -- the backslash is used to escape otherwise s
+pecial characters.

  /Number:\d+ Some word: \w+/

This is a string that includes a number and a word, e.g. "Blah Number:
+ 1234 Some word: foo1bar Blah"

=head2 Substitution

Expressions can be replaced with new values using the I<s///> substitu
+tion operator:

  s/a/b/

Replaces an 'a' with 'b'

  s/a/b/g

Replaces all 'a' with 'b'

  s/a/b/ig

Replaces all 'a' or 'A' with 'b'

  s/n=(\d+)/N($1)/

Changes "n=1234" to "N(1234)".  When there are parentheses in the patt
+ern, they are used for grouping and for capturing -- the first set of
+ parens because $1, the second $2, and so on.

=head2 More Regex

This has barely scratched the surface, but we will see example usage o
+f more regex components below.

=head1 Perl

The first place to start with command-line perl is the perlrun manpage
+, and looking at & copying/using one-liner examples.

  perl -e 'print "hello world\n"'

Using I<-p> to loop through a file and print each line:

  f=/tmp/datafile.txt
  perl -pe '' $f
  perl -pe 's/a/BBBBB/' $f
  perl -pe 's/a/BBBBB/g' $f

Using I<-n> to loop through a file and look at each line:

  perl -ne '' $f
  perl -ne 'print' $f
  perl -ne 'print $_' $f
  perl -ne 'print if /a/' $f
  perl -ne 'print "$.)" . $_' $f
  perl -ne 'print "$.)" . $_ if $. % 2 == 0' $f

Some things seen so far:

=over 2

=item $_

This is one of many special variables (see man perlvar) that perl has.
+ It is perhaps the most special because it is the "default" -- whenev
+er you don't supply a command with something it assumes you want to u
+se $_

=item if(){}

Basic IF clause in perl -- similar to other languages.  I<if( ... ){ .
+.. }elsif( ... ){ ... }else{ ... }>

=item ... if ... ;

Perl lets you short-hand simple if statements by reversing the order, 
+which is also nice because it's less lines (and no curlies) and can b
+e more natural to read.  Perl also provides I<unless> which is simply
+ a shortcut for I<if(!( ... ))>

  print "ok" if $ok;
  print "bad" if ! $ok;
  print "bad" unless $ok;
  while( ... ){
    next unless ... ;
    last if ... ;
  }

=item $.

This is another special variable (see man perlvar) that is the current
+ line number when reading in a file.

=back

So now we can take a closer look at this:

  perl -ne 'print if /a/' $f

And write it more explicitly in several ways to demonstrate the syntax
+:

  perl -ne 'print $_ if /a/' $f
  perl -ne 'if( /a/ ){ print $_ }' $f
  perl -ne 'print $_ if $_ =~ /a/' $f
  perl -ne 'print $_ unless $_ !~ /a/' $f

Here is a good time to note that the unofficial Perl motto is B<TMTOWT
+DI> (There's More Than One Way To Do It).

Another powerful command-line option is I<-a> to Auto-split, much like
+ cut & awk do.

-aF


=head1 Examples

=head2 A geometry file needs to become many files

  split --lines=30 geoms.xyz /tmp/g___
  for f in /tmp/g___* ; do
    d=`head -1 $f | sed s/^**//`
    mkdir -p blah/$d
    tail +2 $f > blah/$d/geom
  done

=head2 Rename a bunch of .tpl files, dropping the extension

  for n in `seq 1 3` ; do touch f$n.tpl ; done

  for f in *.tpl ; do mv $f ${f%.tpl} ; done

  ls *.tpl | perl -ne 'chomp;$f0=$_;s/\.tpl$//;print "mv $f0 $_\n"'
  ls *.tpl | perl -pe 's/^(.+)(\..*)/mv $1$2 $1/'

=head2 Get the first & fourth numbers from certain lines of a file

If you look at the second line, it starts with BOMD, and then numbers.
+  I want to pick a first (-264.05765232) and the fourth number (0.000
+00000000) and write it in a new file.  Then I want to repeat this in 
+every data entries in the file (as you can see, one entry takes 13 li
+nes).

  grep '^ BOMD' deMon.mol | awk '{print $3, $6}'  > deMon.mol.filtered
  grep '^ BOMD' deMon.mol | perl -alne 'print "$F[2] $F[5]"' > deMon.m
+ol.filtered
  perl -alne 'print "$F[2] $F[5]" if $F[0] eq "BOMD"' deMon.mol > deMo
+n.mol.filtered

=head2 Get the number of days between two dates

  perl -MDate::Calc=Delta_Days -le 'print Delta_Days(2005,9,16, 2006,2
+,28)'

=head2 Display a web page's source

  perl-MLWP::Simple -e "print get(shift)" http://www.perlmonks.org
  wget -O - http://www.perlmonks.org
  lynx --source http://www.perlmonks.org

=head2 Get lines N -> M of a file

These examples show how to display lines 5-8, inclusive from the /etc/
+passwd file:

   head -8 /etc/passwd | tail -4
   tail +5 /etc/passwd | head -4
   perl -ne 'print if 5<=$. && $.<=8' /etc/passwd
   # man perlvar   for explanation of $.
   cat -n /etc/passwd | perl -ne 'print if s/^\s*[5678]\s+//'

Now, to get the lines from /etc/password starting at a line with
"news" in it, and stopping at a line with "ftp" in it, these all work
(all the same except ordering, which determines whether or not the
start and/or end lines are included):

=over 1

=item [start, end]

    perl -ne '$ok||=/news/; print if $ok; $ok=0 if /ftp/' /etc/passwd

=item [start,end)

    perl -ne '$ok||=/news/; $ok=0 if /ftp/; print if $ok' /etc/passwd

=item (start,end]

    perl -ne 'print if $ok; $ok||=/news/; $ok=0 if /ftp/' /etc/passwd

=item (start,end)

    perl -ne '$ok=0 if /ftp/; print if $ok; $ok||=/news/' /etc/passwd

=back

Basic approach is to take advantage of -n (man perlrun) and flip a
flag on/off at the boundries.  Note that the /news/ is a regex, and
can take complex patterns (man perlre)

=head2 Lazy math

  perl -le 'print( 3+5 )'   # need the parens here

There is also 'bc' command.

=head2 Make & use a program to sum numbers

  alias add="perl -lne '\$x+=\$_; END{print \$x}'"
  cut -f1 file1 | add

=head2 Perl One-liners

=over 2

=item Favourite One-liners?

L<http://perlmonks.org/?node_id=470397>

A web server!

  perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $
+1 ? "./$1 |" : $1) if /^GET \/(.*) / })'

dos2unix

  perl -pi -e 's/\r//' filename

=item What one-liners do people actually use?

L<http://perlmonks.org/?node_id=515336>

=item One Liners

L<http://perlmonks.org/?node_id=421195>

=back

=head1 Reference Material

=over 2

=item man

Also note that the command I<apropos> searches man pages.

=item man perl

Which is basically a table of contents for the many perl man pages. On
+es of particular interest are these manpages: perl perlrun perlsyn pe
+rlfunc perlre perlretut

=item perldoc

Displays documentation for everything perl.

  perldoc -f sleep
  perldoc perlfunc
  perldoc -q how
  perldoc File::Find

=item CPAN

L<http://search.cpan.org> is one of Perl's great strengths -- it is a 
+huge repository of modules (libraries) to do pretty much anything and
+ everything with perl.

=item Perl Monks

L<http://perlmonks.org> is a great Perl community site.  The knowledge
+ base of the forums, tutorials, and FAQ's is very extensive and the m
+embers are very open & willing to help with any level (complete begin
+ner through guru) question.

This talk is posted at http://perlmonks.org/?node_id=553278

=item ME!

I love to help with this stuff -- it's my vocation & hobby.  I'm reach
+able at E<lt>dwestbrook@gmail.comE<gt> or as davidryan0 on AIM.

=back

=cut
[download]

Comment on RFC: Text Processing for Chemists Tutorial Download Code

Replies are listed 'Best First'.
Re: RFC: Text Processing for Chemists Tutorial by kvale (Monsignor) on Jun 02, 2006 at 15:31 UTC
This is a lot of information to absorb in just one hour. You are progressing from the simplest Unix commands to a web server. Chemistry students are bright, but I don't think anyone starting from complete ignorance of Unix, and perhaps programming, is going to be able to understand all of this in real time. I taught a Perl for Bioinformatics class a few years ago that also was an intro to the Linux CLI. We had 10 hour long lectures anlong with 10 two hour labs. At the end of the course, most students could write simple programs, but mabye 10% managed to clue into Perl's real power. Especially at the very beginning, programming is hard. So I'd recommend at the very least a handout of your talk (not in raw POD) so that they can play with your examples on their own. And for the first time regexer, I'd recommend perlrequick rather than perlretut. It is written more simply and is less overwhelming. -Mark	[reply]
Re^2: RFC: Text Processing for Chemists Tutorial by davidrw (Prior) on Jun 02, 2006 at 15:52 UTC
Yeah, it was definitely a lot... but the goal was exposure -- "there exists these set of commands to do your work for you" and not actually being able to talk the talk and go script stuff. Luckily the whole audience (~dozen) had at least basic linux/programming experience so it wasn't starting from "this is a prompt" or anything.. They seemed to be following the two real-life chemistry data file examples at least.. I used (my own hacked up version since it doesn't have much in terms of config options) Pod::Pdf to create a nice PDF version to print up for them. cool -- i actually didn't know about perlrequick -- i knew i'd learn something from replies to this post! :)	[reply]
Re: RFC: Text Processing for Chemists Tutorial by planetscape (Chancellor) on Jun 03, 2006 at 13:41 UTC
The online book, Data-Intensive Linguistics, by Chris Brew and Marc Moens, has a nice introduction to using UNIX tools for linguistic processing. You may wish to have a look at it and adapt some of what's there, or just link to it in a "Resources" or "Further Reading" section. As a member of pedagogues, I also want to thank you for thinking of contributing to our Tutorials section. Once you've hammered out a final version based on comments in this thread, you should, IMHO, feel free to post it as a Tutorial. (Maybe after running it through pod2html or something, though...) :-) HTH, Update: This link, on the page noted above, may also be useful... planetscape	[reply]