Re: Regex Question
by talexb (Chancellor) on Sep 10, 2013 at 18:07 UTC
|
When dealing with regular expressions, my approach is to do the simplest thing that works.
In your case, this means looking for three capital letters, following by some stuff, and then a two or three digit number, a dash, and the same two or three digit number.
So the regexp I'd use would be
/([A-Z]{3}).+(\d{2,3})-\2/
This makes the assumption that the two numbers at the end are the same. If they might be different, you'd have to change the second capture to something like
/([A-Z]{3}).+(\d{2,3}-\d{2,3})/
When in doubt, try the simplest thing that could work. You're putting 'Protocol' and other stuff in there -- throw it away -- you don't need it. :)
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] [d/l] [select] |
|
|
Talexb, thanks for the reply. Please see my other reply to keszler. The protocol and destination ports are in there b/c that's how I can pick them out. The data can follow this structure:
something, Protocol: UDP, something, Destination Port: x-y, something, something, Protocol: TCP, something, Destination Port: a-b, something
So, reliably picking them out can be challenging without the field names. Also, this data I'm working with doesn't always follow this format, so I'm needing to ignore the data that doesn't.
| [reply] [d/l] |
|
|
... and/or do it in stages: write one regex to grab an entire-line and then a second regex to break down what the first one grabbed.
| [reply] |
Re: Regex Question
by kcott (Archbishop) on Sep 10, 2013 at 19:46 UTC
|
$ perl -Mstrict -Mwarnings -le '
my $cell = <<EOD;
<td>
1. Network-Time, Protocol: TCP, Source Port: 0-65535, Destination Port
+: 13-13
2. Network-Time-1, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 13-13
3. Network-Time-2, Protocol: TCP, Source Port: 0-65535, Destination Po
+rt: 37-37
4. Network-Time-3, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 37-37
5. Network-Time-4, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 123-123
</td>
EOD
my $re = qr{Protocol:\s+(\w+).*?Destination Port:\s+(\S+)}m;
my @extract;
push @extract, join(":", $1, $2) while $cell =~ /$re/g;
print "@extract";
'
TCP:13-13 UDP:13-13 TCP:37-37 UDP:37-37 UDP:123-123
Notes:
-
The 'm' modifier says the data is multiline.
-
The '.*?' handles the greediness issue.
-
I don't know what "$htmlStream->get_trimmed_text('/td')" does.
It may be removing leading and trailing whitespace from each cell; it may be removing the '<td>' and '</td>' tags; it may be doing both of those; it may be doing something else.
Regardless, it might be unnecessary with the solution I've provided: I added the tags to the test data to show this can handle an entire, multiline "<td>...</td>" block.
| [reply] [d/l] [select] |
Re: Regex Question
by keszler (Priest) on Sep 10, 2013 at 17:07 UTC
|
push @svcDesc, "$1:$2" while ($htmlStream->get_trimmed_text('/td') =~
+/Protocol:\s(TCP|UDP)[^\n]+Destination\sPort:\s([0-9\-]+)/g);
Note that I removed the outer set of parens in the regex. | [reply] [d/l] |
|
|
This is close. I should have pointed out that each cell can contain multiple lines. The sample text I posted was one cell. I'm having trouble with greediness/unknown numbers of occurrences in a single line. One line might contain something, Protocol: UDP, something, Destination Port: x-y, something, something, Protocol: TCP, something, Destination Port: a-b, something
I need to extract each protocol/port pair into an array, and there's an unknown number. Protocol always comes before port, but not always immediately.
| [reply] [d/l] |
|
|
| [reply] [d/l] |
Re: Regex Question
by clegane (Novice) on Sep 10, 2013 at 19:35 UTC
|
Ok, I ended up figuring out the rest. It was a bunch of problems together. I had line breaks in the text, and get_trimmed_text was taking out tags and goofing up all sorts of stuff. Here's the solution I ended up with. Let me know if you have better ideas! Thanks for the help!!!
$svcDesc = $htmlStream->get_phrase('/td');
$svcDesc =~ s/\n/,/g;
push @svcPorts, "$1:$2" while ($svcDesc =~ /Protocol:\s(TCP|UDP).+?Des
+tination\sPort:\s([0-9\-]+)/g);
s/:(\d+)\-\1/:$1/ for (@svcPorts);
| [reply] [d/l] |
|
|
use strict;
use warnings;
my $cell = <<EOT;
1. Network-Time, Protocol: TCP, Source Port: 0-65535, Destination
+ Port: 13-13
2. Network-Time-1, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 13-13
3. Network-Time-2, Protocol: TCP, Source Port: 0-65535, Destination Po
+rt: 37-37
4. Network-Time-3, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 37-37
5. Network-Time-4, Protocol: UDP, Source Port: 0-65535, Destination Po
+rt: 123-123
EOT
my @svcDesc = map { s/.*(TCP|UDP).*?Destination Port: ([0-9]+).*/$1:$2
+/; $_ } split /\n/, $cell;
print "@svcDesc\n";
| [reply] [d/l] |