Backtracking for substitutions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Backtracking for substitutions by chromatic (Archbishop) on Oct 02, 2000 at 01:33 UTC
Here's what I did: `#!/usr/bin/perl -w use strict; my @lines = ( 'Bob Smith bsmith 00001234567 01/01/1986 00:00:00', 'Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:00 00001234563 01/ +01/1986 00:00:00', 'Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:00'); foreach my $line (@lines) { $line =~ s/(\s[a-z]+\s0{4})/,$1/g; print "New line:\n'$line'\n"; }` [download] I didn't use \w because that goes for alphanumerics. I matched the whole login name with the character class [a-z]+, with one whitespace character before and after it. I didn't need to match the rest of the string, because that particular combination only occurs once in your dataset. If I were doing this on my own, I might look into split. I like split. Split on the token (you know it's an n-digit number): `my ($user, $token, $login) = split(/\s(\d{11})\s/, $line, 2); (my $username = $user) =~ s/\s(\w+)$//; # untested construct print "$username, claiming to be $user logged on with token $token, at + $login.\n";` [download] Then again, the regex seems pretty straightforward.	[reply] [d/l] [select]
Re: Backtracking for substitutions by japhy (Canon) on Oct 02, 2000 at 02:24 UTC
Your only problem was that `\w` matches a word character, not a word. Add a `+` after it, and you should be fine. `$_="goto+F.print+chop;\n=yhpaj";F1:eval`	[reply]
Re: Backtracking for substitutions by extremely (Priest) on Oct 02, 2000 at 01:44 UTC
Are all the names guaranteed to start with capitals? Are all the logins lowercase only? If you you can match on the case variation this works: `foreach (<>) { s/\s+(?=[a-z])/,/; s/\b([a-z]+)\s/$1,/; s/\b\s+\b(\d\d\/)/,$1/g; s/(:\d\d)\s\b/$1,/g; }` [download] It ain't purty but it works. -- $you = new YOU; honk() if $you->love(perl)	[reply] [d/l]
Re: Backtracking for substitutions by 2501 (Pilgrim) on Oct 02, 2000 at 08:27 UTC
`my @lines = ( 'Bob Smith bsmith 00001234567 01/01/1986 00:00:00', 'Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:00 00001 +234563 01/01/1986 00:00:00', 'Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:00 +'); foreach $line (@lines){ $data = $line; $line =~ s/^(\D+)(.)/$1/; $data =~ s/^(\D+)(.)/$2/; $data =~ s/(\d+)\s(\w+)\s(\w+)/$line,$1,$2,$3,\n/g }` [download] This is still abit rough around the edges but I hope you will get the idea. I took the original question to mean that he wanted extra tokens identifiable on seperate lines in a csv format. So Mary would end up having two lines in the final output. What I tried to do was store the alpha text, identify the data, break the data down into 'sets' and finish it up by attaching the alpha text and delimiting it. Now that i think about it, would a \w pickup slashes and colons? Maybe it needs an '\|' in there also:P	[reply] [d/l]
Re: Backtracking for substitutions by runrig (Abbot) on Oct 02, 2000 at 04:18 UTC
In two steps: `s\|^(.*?)\s(\w+)\s(0000\d+)\s(\d+/\d+/\d+\s\d+:\d+:\d+)\|$1,$2,$3,$4\|; s\|\s(0000\d+)\s(\d+/\d+/\d+\s\d+:\d+:\d+)\|,$1,$2\|g;` [download]	[reply] [d/l]
A two-liner for Backtracking for substitutions by Anonymous Monk on Oct 02, 2000 at 23:02 UTC
My solution is really just two lines of substitution code. Here they are: `$mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.)/$1,$2,$3/g; $mydata =~ s/(:\d{2})\s0000/$1,$2/g;` [download] And if you're not easily overwhelmed by lots of documentation, here's the whole program with setup code, comments, and output: #!/usr/bin/perl # NODE34853.pl # Assumptions: There are potentially any number of parts of a user's n +ame. # For example, "Bill Clinton" might be a user's name, but # "William Jefferson Clinton the Liar" might also be his name. # The user's name is next, which is always one word long. Might also h +ave numbers # in it, such as Bill69. # The data is currently in a single scalar (as though you read it from + a flat file). # And, It's not clear what granularity you want the data to have. I'm +assuming # that you want the user's name, his username, and the individual chun +ks of login # data. Do you also want to split up the login data? Your post didn't +say. # # Knowing what you want to do with this data afterwards would also hel +p. If you want to # load this into a SQL database, then you'd probably want to do this a + bit differently. # But, if your goal is just to comma-delimit the file so you can load +it into # a spreadsheet, then this oughta do the trick. # # This solution is really just a two line program with lots of comment +s and some # stuff to setup the environment and print the results. # I hope it helps. # --Mark # # This line just sets up the scalar variable you want to parse. # I'm assuming you have other methods of doing this (reading from # CSV, etc.) $mydata = <<ENDDATA; Bob Smith bsmith 00001234567 01/01/1986 00:00:00 Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:01 00001234563 01/01/19 +86 00:00:02 00001234563 01/01/1986 00:00:03 Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:01 00001234569 01 +/01/1986 00:00:02 ENDDATA # The purpose of this regex is just to split out the user's NAME, # USERNAME, and associated DATA. We're leaving the guts of the DATA al +one for now. $mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.)/$1,$2,$3/g; #MyData temporarily looks like this: #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01 00001234563 01/01/1 +986 00:00:02 00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01 00001234569 0 +1/01/1986 00:00:02 # Now, let's split up the DATA parts by looking for the space between +the :00 and 0000 $mydata =~ s/(:\d{2})\s0000/$1,$2/g; print "All done. MyData now looks like this\n$mydata\n\n"; #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01,00001234563 01/01/1 +986 00:00:02,00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01,00001234569 0 +1/01/1986 00:00:02 [download] I hope this helps. Let us know. --Mark	[reply] [d/l] [select]
(Ovid - Common regex error) RE: A two-liner for Backtracking for substitutions by Ovid (Cardinal) on Oct 02, 2000 at 23:16 UTC
I didn't go over your code in detail, but I did notice a common regex error: `[\w\d]` [download] Many people (including yours truly at one time), mistakenly assume that `\w` does not match 0-9. Surprise! It does. This caused me a problem when I was trying to do the following: `my $text = "product1234imageSmall.jpg"; ($type, $id, $property) = ($1, $2, $3) if $text =~ /^(\w+)(\d+)(\w+)/;` [download] It failed pretty quickly because `$type` was getting set to `product123` (it didn't pick up the "4" because `\d` had to match something). In this case, because you are including both `\w` and `\d` in a character class, there's only an issue of redundancy and doesn't affect the functioning of the regex. I just wanted to point this out because it's easy to miss that and get bitten in other situations. Cheers, Ovid Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l] [select]
RE: (Ovid - Common regex error) RE: A two-liner for Backtracking for substitutions by markwild (Sexton) on Oct 03, 2000 at 02:44 UTC
Thanks Ovid! I appreciate the friendly amendment. How does the rest of the code look? I'd also like to hear from the original anonymous poster. Did he get his problem solved? --Mark	[reply]
(Ovid - Regex efficiency issues) RE(3): A two-liner for Backtracking for substitutions by Ovid (Cardinal) on Oct 03, 2000 at 03:10 UTC
RE: A two-liner for Backtracking for substitutions by markwild (Sexton) on Oct 02, 2000 at 23:05 UTC
Woops. Forgot to log in before I sent that last post. Still hoping it helps. --Mark	[reply]