clegane has asked for the wisdom of the Perl Monks concerning the following question:

I have an HTML table I'm parsing. Once of the columns in the table is populated with fields that look like this:

1. Network-Time, Protocol: TCP, Source Port: 0-65535, Destination + Port: 13-13 2. Network-Time-1, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 13-13 3. Network-Time-2, Protocol: TCP, Source Port: 0-65535, Destination Po +rt: 37-37 4. Network-Time-3, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 37-37 5. Network-Time-4, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 123-123

I'm wanting to parse this into an array that contains elements that look like this:

@array=(TCP:13-13,UDP:13-13,TCP:37-37,UDP:37-37,UDP:123-123)

What would be an elegant way to do this? Here's what I've got so far, but it's not working.

@svcDesc = ($htmlStream->get_trimmed_text('/td') =~ /Protocol:\s((TCP|UDP)[^\n]+Destination\sPort:\s([0-9\-]+))/g);

Thank you!

Replies are listed 'Best First'.
Re: Regex Question
by talexb (Chancellor) on Sep 10, 2013 at 18:07 UTC

    When dealing with regular expressions, my approach is to do the simplest thing that works.

    In your case, this means looking for three capital letters, following by some stuff, and then a two or three digit number, a dash, and the same two or three digit number.

    So the regexp I'd use would be

    /([A-Z]{3}).+(\d{2,3})-\2/
    This makes the assumption that the two numbers at the end are the same. If they might be different, you'd have to change the second capture to something like
    /([A-Z]{3}).+(\d{2,3}-\d{2,3})/
    When in doubt, try the simplest thing that could work. You're putting 'Protocol' and other stuff in there -- throw it away -- you don't need it. :)

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Talexb, thanks for the reply. Please see my other reply to keszler. The protocol and destination ports are in there b/c that's how I can pick them out. The data can follow this structure:

      something, Protocol: UDP, something, Destination Port: x-y, something, something, Protocol: TCP, something, Destination Port: a-b, something

      So, reliably picking them out can be challenging without the field names. Also, this data I'm working with doesn't always follow this format, so I'm needing to ignore the data that doesn't.

      ... and/or do it in stages: write one regex to grab an entire-line and then a second regex to break down what the first one grabbed.
Re: Regex Question
by kcott (Archbishop) on Sep 10, 2013 at 19:46 UTC

    G'day clegane,

    This appears to do what you want:

    $ perl -Mstrict -Mwarnings -le ' my $cell = <<EOD; <td> 1. Network-Time, Protocol: TCP, Source Port: 0-65535, Destination Port +: 13-13 2. Network-Time-1, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 13-13 3. Network-Time-2, Protocol: TCP, Source Port: 0-65535, Destination Po +rt: 37-37 4. Network-Time-3, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 37-37 5. Network-Time-4, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 123-123 </td> EOD my $re = qr{Protocol:\s+(\w+).*?Destination Port:\s+(\S+)}m; my @extract; push @extract, join(":", $1, $2) while $cell =~ /$re/g; print "@extract"; ' TCP:13-13 UDP:13-13 TCP:37-37 UDP:37-37 UDP:123-123

    Notes:

    • The 'm' modifier says the data is multiline.
    • The '.*?' handles the greediness issue.
    • I don't know what "$htmlStream->get_trimmed_text('/td')" does. It may be removing leading and trailing whitespace from each cell; it may be removing the '<td>' and '</td>' tags; it may be doing both of those; it may be doing something else. Regardless, it might be unnecessary with the solution I've provided: I added the tags to the test data to show this can handle an entire, multiline "<td>...</td>" block.

    -- Ken

Re: Regex Question
by keszler (Priest) on Sep 10, 2013 at 17:07 UTC
    push @svcDesc, "$1:$2" while ($htmlStream->get_trimmed_text('/td') =~ +/Protocol:\s(TCP|UDP)[^\n]+Destination\sPort:\s([0-9\-]+)/g);
    Note that I removed the outer set of parens in the regex.

      This is close. I should have pointed out that each cell can contain multiple lines. The sample text I posted was one cell. I'm having trouble with greediness/unknown numbers of occurrences in a single line. One line might contain

      something, Protocol: UDP, something, Destination Port: x-y, something, something, Protocol: TCP, something, Destination Port: a-b, something

      I need to extract each protocol/port pair into an array, and there's an unknown number. Protocol always comes before port, but not always immediately.

      I've got the following going, but it's only getting the first match. If I remove the '?', it only grabs the last match. I need it to get them all. Adding a 'g' at the end doesn't seem to help

      push @svcDesc, "$1:$2" while ($htmlStream->get_trimmed_text('/td') =~ /Protocol:\s(TCP|UDP).+?Destination\sPort:\s([0-9\-]+)/);
Re: Regex Question
by clegane (Novice) on Sep 10, 2013 at 19:35 UTC

    Ok, I ended up figuring out the rest. It was a bunch of problems together. I had line breaks in the text, and get_trimmed_text was taking out
    tags and goofing up all sorts of stuff. Here's the solution I ended up with. Let me know if you have better ideas! Thanks for the help!!!

    $svcDesc = $htmlStream->get_phrase('/td'); $svcDesc =~ s/\n/,/g; push @svcPorts, "$1:$2" while ($svcDesc =~ /Protocol:\s(TCP|UDP).+?Des +tination\sPort:\s([0-9\-]+)/g); s/:(\d+)\-\1/:$1/ for (@svcPorts);

      I would prefer to split into lines first and then extracting the data:

      use strict; use warnings; my $cell = <<EOT; 1. Network-Time, Protocol: TCP, Source Port: 0-65535, Destination + Port: 13-13 2. Network-Time-1, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 13-13 3. Network-Time-2, Protocol: TCP, Source Port: 0-65535, Destination Po +rt: 37-37 4. Network-Time-3, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 37-37 5. Network-Time-4, Protocol: UDP, Source Port: 0-65535, Destination Po +rt: 123-123 EOT my @svcDesc = map { s/.*(TCP|UDP).*?Destination Port: ([0-9]+).*/$1:$2 +/; $_ } split /\n/, $cell; print "@svcDesc\n";