Re: tagging question
by LassiLantar (Monk) on Jul 23, 2004 at 23:20 UTC
|
$string = "il asdfasdfasdf";
$string =~ s/([\S]{2})//;
$tag = "<$1>";
$string = $tag . "$string ". $tag
Would do what you're asking for. I'm sure a true perl master could condense that into 1-2 lines, but I'm just a little perl footsoldier right now...
Peace,
LassiLantar | [reply] [d/l] |
|
|
| [reply] |
|
|
True, true. Again, I am outclassed =)
Peace,
LassiLantar
| [reply] |
|
|
I'm pretty sure you can make this code shorter if you really want to, but I am curious why you chose to substitute all lines. That, to me, looks like a lot of useless hassle ;)
Anyways, for the OP, my €0,02:
while(<DATA>) {
print "<$1>$1$2<$1>" if $_ =~ m|(\w{2})(.*)|;
}
__DATA__
il yadayadayada
df yadayadayada
--
b10m
All code is usually tested, but rarely trusted.
| [reply] [d/l] |
|
|
| [reply] |
|
|
I must not be a true perl master, because I hacked on your program and it got BIGGER!
#!/usr/bin/perl
# you have to use strict and warnings unless you
# have a really good reason not to.
use strict;
use warnings;
my $string = "il asdfasdfasdf";
my $tag = "";
# use matching here instead of substitution
# all of the string should appear in the output
# also, don't need square brackets in match
if ($string =~ m/(\S{2})/)
{
$tag = "<$1>";
}
# you don't need to concatenate, just interpolate the lot
$string = "$tag $string $tag";
print "string = $string\n";
__END__
| [reply] [d/l] |
|
|
I must not be a true perl master, because I hacked on your program and it got BIGGER!
Gwuahaha! I am superior! (read: I am too lazy to write in use strict/use warnings on PM). I agree with you, use strict and warnings are totally necessary. I'm so lazy I even sometimes try to circumvent use strict by redeclaring my variables in random places, but really they're improving the way I write code. (As is sparring with the monks).
Peace,
LassiLantar
| [reply] |
Re: tagging question
by graff (Chancellor) on Jul 24, 2004 at 04:17 UTC
|
I'm working on tagging a large linguistic corpus
Been there, done that. (Still there, doing it, in fact...)
What I need to do is add a tag around each line (<il> or <df> in the above cases) where the contents of the tag match the two character string at the head of each line:
<il> il yadayada <il>
Might you happen to be somewhat new to the area of markup languages (i.e. XML) also? You may want to double-check what the goal is supposed to be. Many people doing linguistic-related research would prefer to use real XML in their corpus data, and what you proposed is not real XML, despite having something in common with it (using angle brackets).
There are two things you should consider (maybe ask others in your group/research community to get their suggestions):
- The tags you add should be paired like this:
<tag> text content ... </tag>
Note the slash character in the second tag that marks the end of the region -- that's required.
- If the initial "token" on each is really a classifier (i.e. an annotation that someone has added to the corpus data, rather than being part of the original spoken or written corpus content), then the XML tags ought to replace the classifier, rather than simply being placed around it.
On the second point, I could see wanting to leave the 2-letter code in the line, just to make sure you put the tags in the right way, but there are better ways to validate your process.
If I'm guessing right about what you really should be doing, your regex should just put angle brackets around the initial 2-character token, then make a copy of it at the end of the line with a slash added as needed. Something like this:
s{^(\w{2})(.*)}{<$1>$2 </$1>};
(I chose to use curlies around the regex and replacement, just so I wouldn't have to use a backslash-escape for the slash in the closing tag.)
(P.S.: Welcome to the Monastery!) | [reply] [d/l] [select] |
Re: tagging question
by Ovid (Cardinal) on Jul 23, 2004 at 23:49 UTC
|
If you just want this on the command line to read from one file and write to STDOUT (great for seeing that it works):
perl -pe '/^(\w{2})(.*)/;$_ = "<$1>$1$2<$1>\n"' data.txt
| [reply] [d/l] |
Re: tagging question
by beable (Friar) on Jul 23, 2004 at 23:14 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
# read in the data line by line
while (my $line = <DATA>)
{
# chomp off the newline
chomp $line;
# see if we have a match of two letters at the start
# of the line
if ($line =~ m|^(\w{2})|)
{
# if it matched, add tags
my $tag = $1;
print "<$tag> $line <$tag>\n";
}
else
{
# if it didn't match, just print the line
print "$line\n";
}
}
__DATA__
il yadayadayada
df yadayadayada
| [reply] [d/l] |
Re: tagging question
by NetWallah (Canon) on Jul 23, 2004 at 23:09 UTC
|
my $x='il yadayadayada';
$x=~s/^(\w{2})(.*)/<$1> $1$2 <$1>/;
print $x;
-- output --
<il> il yadayadayada <il>
Update:beable's (++) nit noted and picked.
Earth first! (We'll rob the other planets later)
| [reply] [d/l] |
|
|
Dude, the output is supposed to be:
<il> il yadayada <il>
. Therefore, you should have written this:
$x=~s/^(\w{2})(.*)/<$1> $1$2 <$1>/;
</nitpick> | [reply] [d/l] [select] |
Re: tagging question
by murugu (Curate) on Jul 24, 2004 at 07:56 UTC
|
while (<DATA>){
s#^(\w{2}).*#<$1>$&<\/$1># && print
}
__DATA__
lg alkjslkjs
sl slksjlkjslkjs
slkjslkjs
| [reply] [d/l] |