(the thingies are actually chinese characters in the file)<file ID="A01"> <p> <s n="0001"> <w POS="a">大</w> <w POS="n">墙</w> <w POS="f +">内外</w> <c POS="w">--</c> <w POS="ns">&# +21271;京市</w> <w POS="n">监狱</w> <w POS="n" +>纪实</w> <c POS="w">(</c> <w POS="m">三</w> + <c POS="w">)</c> </s> </p> <p> <s n="0002"> <w POS="nr">田</w> <w POS="nr">珍颖</w> + </s>
(the new line is counted as one position, thus resulting in the position 17 for the eleventh item)1 大 a 2 墙 n 3 内外 f 5 -- w 7 北京市 ns 10 监狱 n 12 纪实 n 13 ( w 14 三 m 15 ) w 17 田 nr 18 珍颖 nr
The input file is encoded in utf-8 and length($string) returns thrice the size of the string it measures. perldoc states it should return the number of characters in the string.#!/usr/bin/perl -w use encoding 'utf-8'; die "Usage : $0 source_file(XML)\n" unless (@ARGV>=1); open (FILEIN, "$ARGV[0]") || die "Unable to open source file $ARGV[0] +: $!\n"; open (FILEOUT, ">:utf8", "$ARGV[0]_tok_cat.lst") || die "Unable to ope +n destination file $ARGV[0]_tok_cat.lst : $!"; $pos = 0; while (<FILEIN>) { if (m/^\<s/) { $pos++; #@tmp = split; foreach $item (split) { if ($item =~ m/\>.+\</) { $cat = $item; $item =~ s/POS=\"[a-z]+\">(.+)\<\/[cw]>/$1/; $cat =~ s/POS=\"([a-z]+)\">.+\<\/[cw]>/$1/; print FILEOUT "$pos\t$item\t$cat\n"; $pos+=length($item); } } } } close FILEIN; close FILEOUT;
In reply to length() and Unicode by Lu.
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |