bagerson has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I'm working on tagging a large linguistic corpus, but being a perl/programming newbie I'm having some problems. The corpus is divided by line, at the start of each line there is a delimiter, for instance:

il yadayadayada

df yadayadayada

What I need to do is add a tag around each line (<il> or <df> in the above cases) where the contents of the tag match the two character string at the head of each line:

<il> il yadayada <il>

Does anyone have a snippet that would give me a clue as to how I get just the first two letter string out and into a tag?

Thanks in advance.

Replies are listed 'Best First'.
Re: tagging question
by LassiLantar (Monk) on Jul 23, 2004 at 23:20 UTC

    If I understand your question properly...

    $string = "il asdfasdfasdf"; $string =~ s/([\S]{2})//; $tag = "<$1>"; $string = $tag . "$string ". $tag

    Would do what you're asking for. I'm sure a true perl master could condense that into 1-2 lines, but I'm just a little perl footsoldier right now...

    Peace,
    LassiLantar

      I'm sure a true perl master could condense that into 1-2 lines.

      But a true Perl master would go for clarity over conciseness unless the circumstances dictate otherwise.

      Cheers,
      Ovid

      New address of my CGI Course.

        True, true. Again, I am outclassed =)

        Peace,
        LassiLantar

      I'm pretty sure you can make this code shorter if you really want to, but I am curious why you chose to substitute all lines. That, to me, looks like a lot of useless hassle ;)

      Anyways, for the OP, my €0,02:

      while(<DATA>) { print "<$1>$1$2<$1>" if $_ =~ m|(\w{2})(.*)|; } __DATA__ il yadayadayada df yadayadayada
      --
      b10m

      All code is usually tested, but rarely trusted.
        <snappy comeback> Well, he said each line had a tag on the beginning of it, so I figured it would be extra to deal with checking whether it did. </snappy comeback>

        <real excuse> Didn't think of it =) </real excuse>

        Peace,
        LA

      I must not be a true perl master, because I hacked on your program and it got BIGGER!
      #!/usr/bin/perl # you have to use strict and warnings unless you # have a really good reason not to. use strict; use warnings; my $string = "il asdfasdfasdf"; my $tag = ""; # use matching here instead of substitution # all of the string should appear in the output # also, don't need square brackets in match if ($string =~ m/(\S{2})/) { $tag = "<$1>"; } # you don't need to concatenate, just interpolate the lot $string = "$tag $string $tag"; print "string = $string\n"; __END__
        I must not be a true perl master, because I hacked on your program and it got BIGGER!

        Gwuahaha! I am superior! (read: I am too lazy to write in use strict/use warnings on PM). I agree with you, use strict and warnings are totally necessary. I'm so lazy I even sometimes try to circumvent use strict by redeclaring my variables in random places, but really they're improving the way I write code. (As is sparring with the monks).

        Peace,
        LassiLantar

Re: tagging question
by graff (Chancellor) on Jul 24, 2004 at 04:17 UTC
    I'm working on tagging a large linguistic corpus
    Been there, done that. (Still there, doing it, in fact...)
    What I need to do is add a tag around each line (<il> or <df> in the above cases) where the contents of the tag match the two character string at the head of each line:

    <il> il yadayada <il>

    Might you happen to be somewhat new to the area of markup languages (i.e. XML) also? You may want to double-check what the goal is supposed to be. Many people doing linguistic-related research would prefer to use real XML in their corpus data, and what you proposed is not real XML, despite having something in common with it (using angle brackets).

    There are two things you should consider (maybe ask others in your group/research community to get their suggestions):

    1. The tags you add should be paired like this:
      <tag> text content ... </tag>
      Note the slash character in the second tag that marks the end of the region -- that's required.

    2. If the initial "token" on each is really a classifier (i.e. an annotation that someone has added to the corpus data, rather than being part of the original spoken or written corpus content), then the XML tags ought to replace the classifier, rather than simply being placed around it.

    On the second point, I could see wanting to leave the 2-letter code in the line, just to make sure you put the tags in the right way, but there are better ways to validate your process.

    If I'm guessing right about what you really should be doing, your regex should just put angle brackets around the initial 2-character token, then make a copy of it at the end of the line with a slash added as needed. Something like this:

    s{^(\w{2})(.*)}{<$1>$2 </$1>};
    (I chose to use curlies around the regex and replacement, just so I wouldn't have to use a backslash-escape for the slash in the closing tag.)

    (P.S.: Welcome to the Monastery!)

Re: tagging question
by Ovid (Cardinal) on Jul 23, 2004 at 23:49 UTC

    If you just want this on the command line to read from one file and write to STDOUT (great for seeing that it works):

    perl -pe '/^(\w{2})(.*)/;$_ = "<$1>$1$2<$1>\n"' data.txt

    Cheers,
    Ovid

    New address of my CGI Course.

Re: tagging question
by beable (Friar) on Jul 23, 2004 at 23:14 UTC
    #!/usr/bin/perl use strict; use warnings; # read in the data line by line while (my $line = <DATA>) { # chomp off the newline chomp $line; # see if we have a match of two letters at the start # of the line if ($line =~ m|^(\w{2})|) { # if it matched, add tags my $tag = $1; print "<$tag> $line <$tag>\n"; } else { # if it didn't match, just print the line print "$line\n"; } } __DATA__ il yadayadayada df yadayadayada
Re: tagging question
by NetWallah (Canon) on Jul 23, 2004 at 23:09 UTC
    Here is a snippet:
    my $x='il yadayadayada'; $x=~s/^(\w{2})(.*)/<$1> $1$2 <$1>/; print $x; -- output -- <il> il yadayadayada <il>
    Update:beable's (++) nit noted and picked.

        Earth first! (We'll rob the other planets later)

      Dude, the output is supposed to be:
      <il> il yadayada <il>
      . Therefore, you should have written this:
      $x=~s/^(\w{2})(.*)/<$1> $1$2 <$1>/;
      </nitpick>
Re: tagging question
by murugu (Curate) on Jul 24, 2004 at 07:56 UTC

    My code is,

    while (<DATA>){ s#^(\w{2}).*#<$1>$&<\/$1># && print } __DATA__ lg alkjslkjs sl slksjlkjslkjs slkjslkjs