Re: Regex word options
by state-o-dis-array (Hermit) on Oct 31, 2012 at 22:36 UTC
|
There's some stuff there I haven't seen before, but perhaps this might at least get you moving forward:
$work =~ / (\b(?:Transcribed|Arranged),\b? For [^,^\(^#^-]+)/ )
| [reply] [d/l] |
Re: Regex word options
by ww (Archbishop) on Nov 01, 2012 at 01:03 UTC
|
Not exactly as you stated the problem, but an extensible approach:
C:\>perl -E "my @work=(\"I Arranged a meet.\", \"Transcribe this\");
for $work(@work) {
if ($work =~ /Transcribe||Arranged/) {
say $work;
}
}
I Arranged a meet.
Transcribe this
C:\>
The \b does nothing useful as you state your problem; alternation is better done with an ||, ("or"). Solving the captures as you need them is left as an exercise. | [reply] [d/l] [select] |
|
|
The || does not work inside a regular expression (well, it does, but it matches anything). Add some more testing strings. Use single| inside a regex, or use two regexes connected by ||.
| [reply] [d/l] [select] |
|
|
Alas, I erred.
The honorable choroba is correct; blame simple carelessness; intermittent, inadequate internet access in my second consecutive month on the road and my plain errror for the wrongful use of ||. As written above it should be a single Vbar; as choroba notes, it could also be written as:
#!/usr/bin/perl
use 5.10.0;
my @work=("I Arranged a meet.", "Transcribe this", "Shud Not MATCH");
for $work(@work) {
if (($work =~ /Transcribe/) || ($work =~ /Arranged/) ) {
say $work;
}
}
| [reply] [d/l] [select] |
Try this
by space_monk (Chaplain) on Nov 01, 2012 at 10:40 UTC
|
This line Transcribed, For David Jones #
This line Arranged, For Mike Johnson (great)
Terrible weather today
Program:
#!/bin/perl
my $file='regex.txt';
my $work = do {
local $/;
open my $fh, "<", $file
or die "could not open $file: $!";
<$fh>;
};
while ($work =~ /(Transcribed|Arranged),\s+For\s+([\w\s]+)/g) {
print "Match: $1 Person: $2\n";
}
Output:
Match: Transcribed Person: David Jones
Match: Arranged Person: Mike Johnson
Comments:
The \b (word boundary) matches have been removed as they don't really do anything. Note that it looks for (Transcribed|Arranged) as suggested by other The code looks for normal alpha characters and spaces as a name, terminating on the first that doesn't match, but this can be easily changed | [reply] [d/l] [select] |
|
|
Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.
The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of
music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent
format.
To take out the instrument type, I used this regex:
if ( $work =~ / (For [^,^\(^#^-]+)/ )
...
...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But
I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab
the word "Transcribed" OR "Arranged" as well.
state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example
like the first below:
INPUT:
Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices
DESIRED POST-REGEX OUTPUT:
Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
À La Chapelle Sixtine, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song
(NOTE that I'm saving the instrument type in another variable, but that's not the problem.)
To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have
"Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.
To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.
| [reply] [d/l] |
|
|
>perl -wMstrict -le
"my @input = (
'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches
+De Vieillesse, Book 1), Qr Iv/30',
'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)'
+,
'All Through The Night, Traditional Welsh Song Arranged, For Mixed
+Voices',
);
;;
my $not_source = qr{ [^-,^(\x23] }xms;
my $kruft = qr{ \s* ,? \s* }xms;
my $ar_tr = qr{ Arranged | Transcribed }xms;
my $at_for = qr{
$kruft $ar_tr? $kruft For $not_source+ $kruft
}xms;
my $rx_title = qr{ (?! $at_for) . }xms;
;;
for (@input) {
print qq{[[$_]]};
my ($title, $source) =
m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms
;
print qq{:$title: :$source:};
}
"
[[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De
+ Vieillesse, Book 1), Qr Iv/30]]
:Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1
+), Qr Iv/30:
[[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]]
:A La Chapelle Sixtine: :S 360 (Lw G26):
[[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo
+ices]]
:All Through The Night, Traditional Welsh Song: ::
Updates:
- I have since tried this code as a regular source file with accented characters, and it seems to work.
- Actually, m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.
| [reply] [d/l] [select] |
|
|
#!/bin/perl
my $file='regex.txt';
open my $fh, "<", $file
or die "could not open $file: $!";
while (<$fh>) {
chomp;
if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) {
print "$1$4\n";
print STDERR "# Instrument: $3 T/A:$2\n";
}
}
Using your input, ./program.pl 2>/dev/null output is:
Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr
+ Iv/30
À La Chapelle Sixtine, S 360 (Lw G26)
All Through The Night, Traditional Welsh Song
| [reply] [d/l] [select] |
Re: Regex word options
by abualiga (Scribe) on Nov 01, 2012 at 02:51 UTC
|
agree with ww. The '||' operator may be better suited here than the word boundary '\b'. Also, what are you trying with the '?' non-greedy quantifier? Are you looking for lines starting with these words? Perhaps I'm missing something, but you would probably benefit more from providing some input data.
| [reply] |
|
|
if( $inputStr =~ /^(Arranged\,|Transcribe\,)/ ){
### do your stuff
}
| [reply] [d/l] |
Re: Regex word options
by space_monk (Chaplain) on Nov 04, 2012 at 15:07 UTC
|
chomp;
if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) {
print "$1$4\n";
print STDERR "# Instrument: $3 T/A:$2\n";
}
This produces the output requested given the sample input you provided.
| [reply] [d/l] |