Lawliet has asked for the wisdom of the Perl Monks concerning the following question:

if ($table->cell($rownum, 0..2) =~ /\xa0/) { s/\xa0\d+/ /; } else { s/\xa0//; }

Will the above substitutions replace the desired pattern with the the desired..er..pattern? What I mean is, do I need three seperate if/else control structures to do the above successfully, or is doing the above the same as using $_ =~ s/.../../;

Is there a more elegant way of doing this?

Update: Heh, thanks GrandFather. Not really sure why I confused loops and control constructs.

Update2: I'll post more code to lessen the confusion. I forgot that most monks cannot read my mind.

#!/usr/bin/perl -w use strict; use HTML::TableExtract; use DBI; my $te = HTML::TableExtract->new(); $te->parse_file("arbitraryname.html"); my $table = $te->first_table_found; my @totalrows; push @totalrows, $_ foreach $table->rows(); my (@title, @teach, @aides); foreach my $rownum (0..$#totalrows) { # A cell can be called by $table->cell(row,column) if ($table->cell($rownum, 0..2) =~ /\xa0/) { s/\xa0\d+/ /; } else { s/\xa0//; } push @title, $table->cell($rownum, 0) ? $table->cell($rownum, 0) : + ''; push @teach, $table->cell($rownum, 1); push @aides, $table->cell($rownum, 2) ? $table->cell($rownum, 2) : + ''; } foreach my $ele (0..$#title) { print "$title[$ele] - $teach[$ele] - $aides[$ele]\n"; # Testing ou +tput before I uncomment database section =for comment # inserting into a database, etc =cut }

Update3: Ok, now for the whole story.

I am parsing through an html table converted from a pdf file that listed names and job titles. What makes this annoying is that each table cell contains more than one name, and not every row has a job title. Example:

<html><table align="center" border="0" cellpadding="2" cellspacing="0" +><tbody><tr><th align="center" height="24" valign="middle" width="171 +">Boss </th><th colspan="2" align="left" height="24" valign="middle" +width="421">Firstname Surname </th></tr><tr><td align="center" height +="23" valign="middle" width="171">Secretary </td><td colspan="2" alig +n="left" height="23" valign="middle" width="421">Name Surname, Mr Jon +es Smith </td></tr><tr><td align="center" height="23" valign="middle" + width="171">Medical Doctor </td><td colspan="2" align="left" height= +"23" valign="middle" width="421">Bob&nbsp;Middlename Hope </td></tr>< +tr><td align="center" height="23" valign="middle" width="171">Positio +n 1 </td><td align="center" height="23" valign="middle" width="202">W +orker </td><td align="center" height="23" valign="middle" width="219" +>Secretary </td></tr><tr><td height="45" valign="top" width="171"></t +d><td align="left" height="45" valign="top" width="202">Asdf Ghjk </t +d><td align="left" height="45" valign="middle" width="219">Name Lastn +ame, First Last </td></tr><tr><td height="68" valign="top" width="171 +"></td><td align="left" height="68" valign="top" width="202">Sally&nb +sp;Mally </td><td align="left" height="68" valign="top" width="219">J +oe Smoe, The Who, Will Timberland </td></tr><tr><td align="center" he +ight="23" valign="middle" width="171">Position 2 </td><td align="left +" height="23" valign="middle" width="202">Paula Simon </td><td align= +"left" height="23" valign="middle" width="219">Raymonde Maalouf </td> +</tr></html>

The file follows the format, with three columns, one for the title of the position, then a persons name(s), then that person's secretary(ies). I am trying to extract all three elements (all two elements for the first few) and insert them into a database as such:

my $dbh = DBI->connect("DBI:mysql:$dbname:$dburl", "$dbuser", "$db +pass") or die "Could not connect"; my $sth = $dbh->prepare("INSERT INTO $dbtable (position, name, ema +il) VALUES (?, ?, ?)") or die "Could not prepare"; $sth->execute($position, $name, $email) or die "Could not execute" +; $sth->finish(); $dbh->disconnect; }

With the above example, please note that the position will either be Position 1, or Position 1 Secretary, depending on the column, and that the real position name is random. Also note that I can generate their email address easily, and is unrelated to the problem. I just wanted to include that if anything came up.

Oh, and just remembered, the regex is there to strip the &nbsp;'s and replace them with a space if it is before the last name, or nothing if it is at the end of the name (I meant to use /...\w/, not /...\d/). I used \xa0 at the time of writing because that is what I thought I had to strip (today is not my day :\).

I'm so adjective, I verb nouns!

chomp; # nom nom nom

Replies are listed 'Best First'.
Re: Will a substitution in an if/else loop default to $_?
by kyle (Abbot) on Aug 21, 2008 at 03:33 UTC

    What happens when you try it?

    If you want to avoid an if/else for the two similar cases, you can do a substitution like this:

    s/\xa0(\d)*/defined $1 ? ' ' : ''/e;

    If you want to loop over several values, doing this to each of them, you can do it like this:

    for ( $x, $y, $z ) { s/\xa0(\d)*/defined $1 ? ' ' : ''/e; }

    That should work also for some method that returns a list, but it will die if that list contains a read-only value (such as a literal).

    I don't know if this answers your question, but I hope you find it useful anyway.

Re: Will a substitution in an if/else control structure default to $_?
by GrandFather (Saint) on Aug 21, 2008 at 04:50 UTC

    Yup, as I thought - the code is bogus. $table->cell (...) returns the contents of a single cell, not an array. But you already have all the table information in @totalrows. Consider:

    use strict; use warnings; use HTML::TableExtract; my $te = HTML::TableExtract->new (); $te->parse (<<HTML); <table> <tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr> <tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr> </table> HTML my $table = $te->first_table_found; my @totalrows = $table->rows (); print "$_->[0] - $_->[1] - $_->[2]\n" for @totalrows;

    Prints:

    Cell 1 - Cell 2 - Cell 3 Cell 4 - Cell 5 - Cell 6

    I've no idea what you are trying to achieve with the match and substitutions so I've left that stuff right out, but you can loop over the rows using for my $row (@totalrows) {...} where $row is a reference to the array of cells for the current row.

    Maybe you could provide a similar sample with your problem data as the HTML data so we can see what you are trying to do?

    Update: Changing the @totalrows assignment to:

    my @totalrows = map {my $cells = $_; $_ ||= '***' for @{$cells}[0 .. 2]; $cells} $table->rows ();

    sets "missing" cells to a default value ('***'). Running the updated code against the sample HTML generates:

    Boss - Firstname Surname - *** Secretary - Name Surname, Mr Jones Smith - *** Medical Doctor - Bob Middlename Hope - *** Position 1 - Worker - Secretary *** - Asdf Ghjk - Name Lastname, First Last *** - Sally Mally - Joe Smoe, The Who, Will Timberland Position 2 - Paula Simon - Raymonde Maalouf

    Perl reduces RSI - it saves typing

      How should I go about formatting the output as:

      Boss - Firstname Surname Secretary - Name Surname, Mr Jones Smith Medical Doctor - Bob Middlename Hope Position 1 - Asdf Ghjk - Name Lastname, First Last Position 1 - Sally Mally - Joe Smoe, The Who, Will Timberland Position 2 - Paula Simon - Raymonde Maalouf

      To those who cannot pick up the subtleness, I need the position always displayed (so I can loop through each row and insert into the database). I was thinking of inserting each new position into an array, then if there is no current position, use $array[-1]; Would this work?

      I'm so adjective, I verb nouns!

      chomp; # nom nom nom

Re: Will a substitution in an if/else loop default to $_?
by GrandFather (Saint) on Aug 21, 2008 at 03:28 UTC
    if ($table->cell($rownum, 0..2) =~ /\xa0/) {

    acts on the result returned from $table->cell($rownum, 0..2). The two subsequent substitutions (only one of which is performed for a particular pass through the code) will act on $_ - which has not been affected by the code shown. It may be that what you really want is something like:

    my $cellStr = $table->cell($rownum, 0..2); if ($cellStr =~ /\xa0/) { $cellStr =~ s/\xa0\d+/ /; } else { $cellStr =~ s/\xa0//; }

    or you could use the default variable:

    $_ = $table->cell($rownum, 0..2); if (/\xa0/) { s/\xa0\d+/ /; } else { s/\xa0//; }

    Oh, and if/else is not a loop! It is however a control structure.

    Update: having looked at your code just a little longer - What are you trying to do! Are you trying to edit the string returned by the method call, or is there a larger context in which your code actually makes sense and you do want to edit the contents of the default variable (which must be set as a side effect of the call - very nasty!)?


    Perl reduces RSI - it saves typing
      if (/\xa0/) { s/\xa0\d+/ /; } else { s/\xa0//; }

      If the match in the condition expression of the if-statement for \xa0 somewhere in the string (whether the regex is bound to the string in $_ or $cellStr) fails, how can any substitution involving that character be made in the false clause of the if-statement?

      If the match succeeds, isn't it also redundant since the \xa0 character also appears in the substitution regex in the true clause of the if-statement?

      Wouldn't $cellStr be an array If I am assigning three different columns: 0, 1, and 2?

      I'm so adjective, I verb nouns!

      chomp; # nom nom nom

        If it is an array (reference) then the match will stringify the reference and you will end up trying to match against something of the form 'ARRAY(0x1f7c3e4)'. Probably not what you want. Maybe you need to show us the bigger picture because right now what you seem to be doing is completely bogus.


        Perl reduces RSI - it saves typing
Re: Will a substitution in an if/else loop default to $_?
by dragonchild (Archbishop) on Aug 21, 2008 at 03:35 UTC
    A substitution will always default to whatever the current value of $_ is. I think your real question is "If I use something as the operand of a match operator, does that set $_ to it?" The answer to that is no.

    What it looks like you're trying to do is take an lvalue to the result of $table->cell($rownum, 0..2). This is, generally, not something that's recommended to be done (though it is doable, kinda). Much better would be to have the $table object be asked to normalize the value that would be returned by $table->cell($rownum, 0..2) within the object itself. That way, you're probably using some sort of hash value. Even then, you can't do the whole $_ lvalue thing. Honestly, I don't like code that overuses $_. Just be explicit about what you're doing.

    Now, instead of trying to reuse the operand, you could make the match and substitution a little smarter.

    my $strip = '\x0'; $strip .= '\d+' if $table->cell($rownum, 0..2) =~ $strip; $table->strip( $strip, $rownum, 0..2 );
    That requires coding up another method that would do something like
    # The ... indicates "Whatever" - it's not meant to be syntatically-cor +rect Perl. sub strip { my $self = shift; my ($pattern, ... ) = @_; $self->{cells}[...] =~ s/$pattern//; }

    See how that could work better?


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Will a substitution in an if/else control structure default to $_?
by massa (Hermit) on Aug 21, 2008 at 09:44 UTC
    You do know that the second (else) substitution will never happen, don't you? The result of $table->cell($rownum, 0..2) does not have \xa0 in it (it didn't match the first match). So, you can simplify your code and write
    $table->cell($rownum, 0..2) =~ s/\xa0\d+/ /;
    (hoping your method ->cell() returns a lvalue, of course)
    Update: after looking at your code and TableExtract docs, I think what you want is something like:
    foreach my $rownum (0..$#totalrows) { s/\xa0\d+/ /, push @title, $_ for $table->cell($rownum, 0) || ''; s/\xa0\d+/ /, push @teach, $_ for $table->cell($rownum, 1) || ''; s/\xa0\d+/ /, push @aides, $_ for $table->cell($rownum, 2) || '' }
    []s, HTH, Massa (κς,πμ,πλ)