Basic regex to parse source code

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks.

I need help getting a regex to work better. My original "junk source" consists of


window.Grid1 = new ComponentArt_Grid('Grid1');
Grid1.Data = [[33,'DOR00393','ActiveState ActivePerl 5.8.8 Build 819',
+0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No'],[4
+93,'STP00A82','ActiveState Perl Dev DOR Modules 1.0',0,'Software','No
+','Yes','No','Yes','No','No','No','No','No','No'],[34,'DOR00394','Act
+iveState Perl Dev Kit 6.0 Pro Pack',0,'Software','No','No','No','Yes'
+,'Yes','No','No','No','No','No'],[764,'','AD Group Request:  Rev Repo
+rts - Modify Access',1,'General','No','Yes','Yes','Yes','Yes','Yes','
+No','No','No','No'],[81,'STP0028A','Adesso Cyber Pad Software Suite 3
+.13',0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No
+'],[371,'STP009FB','Adesso Cyberpad 3.14
[download]

Which the developer of the system we use didn't believe in using line breaks (making it that much more difficult to line parse. I have about 400+ records that look like this I need to parse information out of, so far I have to copy that junk code into notepad and add new lines to separate the data. The end result looks like

window.Grid1 = new ComponentArt_Grid('Grid1');
Grid1.Data = [
[33,'DOR00393','ActiveState ActivePerl 5.8.8 Build 819',0,'Software','
+No','Yes','No','Yes','Yes','No','No','No','No','No'],
[493,'STP00A82','ActiveState Perl Dev DOR Modules 1.0',0,'Software','N
+o','Yes','No','Yes','No','No','No','No','No','No'],
[34,'DOR00394','ActiveState Perl Dev Kit 6.0 Pro Pack',0,'Software','N
+o','No','No','Yes','Yes','No','No','No','No','No'],
[764,'','AD Group Request:  Rev Reports - Modify Access',1,'General','
+No','Yes','Yes','Yes','Yes','Yes','No','No','No','No'],
[81,'STP0028A','Adesso Cyber Pad Software Suite 3.13',0,'Software','No
+','Yes','No','Yes','Yes','No','No','No','No','No'],
[371,'STP009FB','Adesso Cyberpad 3.14 
.................
[download]

Which you can see everything pretty much ends in a comma or a ]. This extra editing before my parse is becoming tedious, can someone help me get this to work using the same junk source from above?

The information I'm pulling for each record is the id number which is the first number after the [ and the title of the program which is in the 2nd set of ''. For example..

[33,'DOR00393','ActiveState ActivePerl 5.8.8 Build 819',0,'Software','
+No','Yes','No','Yes','Yes','No','No','No','No','No'],
# I would need 33 to be in $1 and ActivePerl 5.8.8 Build 819 to be in 
+$2
[download]

Comment on Basic regex to parse source code Select or Download Code

Replies are listed 'Best First'.
Re: Basic regex to parse source code by toolic (Bishop) on Jul 13, 2009 at 16:29 UTC
Which you can see everything pretty much ends in a comma or a ]. This extra editing before my parse is becoming tedious, can someone help me get this to work using the same junk source from above? This is one way (amongst many). You could use the substitution operator to inject a newline character wherever you have '],'. use strict; use warnings; while (<DATA>) { chomp; s/;/;\n/g; s/],/],\n/g; print; } __DATA__ window.Grid1 = new ComponentArt_Grid('Grid1'); Grid1.Data = [[33,'DOR00393','ActiveState ActivePerl 5.8.8 Build 819', +0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No'],[4 +93,'STP00A82','ActiveState Perl Dev DOR Modules 1.0',0,'Software','No +','Yes','No','Yes','No','No','No','No','No','No'],[34,'DOR00394','Act +iveState Perl Dev Kit 6.0 Pro Pack',0,'Software','No','No','No','Yes' +,'Yes','No','No','No','No','No'],[764,'','AD Group Request: Rev Repo +rts - Modify Access',1,'General','No','Yes','Yes','Yes','Yes','Yes',' +No','No','No','No'],[81,'STP0028A','Adesso Cyber Pad Software Suite 3 +.13',0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No +'],[371,'STP009FB','Adesso Cyberpad 3.14 [download]	[reply] [d/l]
Re: Basic regex to parse source code by ELISHEVA (Prior) on Jul 13, 2009 at 16:21 UTC
What have you tried? We are happy to help you learn the Perl needed to solve this problem yourself, but you need to make the first step. Please post some code with your best guess at how to solve this problem. Some documentation to help you: perlreftut explains how to use regular expressions perlopentut explains how to open and file and process it line by line and then write out the result. Best, beth	[reply]
Re^2: Basic regex to parse source code by Anonymous Monk on Jul 13, 2009 at 16:25 UTC
This is what I have (roughly). `my @output; foreach my $line (@log) { $line =~ m#\[(\d+)\,\'(.+)\]#i; print "we found $1 and $2\n\n"; my $one = $1; my $two = $2; $two =~ s/'//g; print LOG "<a href='http://stpap11/EAD/administrator/EditPackage.aspx? +PackageID=$one&SMSPackageID=$one&Title=&PackageType=0' target='new'>$ +two</a><br>"; }` [download]	[reply] [d/l]
Re^3: Basic regex to parse source code by CountZero (Bishop) on Jul 13, 2009 at 21:04 UTC
In your regex, you do not have to escape the comma and your `(.+)` will slurp everything up to the last `]` in your line, which is definitely not what you want. You only need the characters up to the next `'`, so the (negated) character class `[^']+` will do what you want. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Basic regex to parse source code by CountZero (Bishop) on Jul 13, 2009 at 20:59 UTC
Without (p)re-formatting the code first: use strict; while(<DATA>) { my @results = m/\[(\d+),'[^']+','([^']+)'/g; local $, = ','; print @results; } __DATA__ window.Grid1 = new ComponentArt_Grid('Grid1'); Grid1.Data = [[33,'DOR00393','ActiveState ActivePerl 5.8.8 Build 819', +0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No'],[4 +93,'STP00A82','ActiveState Perl Dev DOR Modules 1.0',0,'Software','No +','Yes','No','Yes','No','No','No','No','No','No'],[34,'DOR00394','Act +iveState Perl Dev Kit 6.0 Pro Pack',0,'Software','No','No','No','Yes' +,'Yes','No','No','No','No','No'],[764,'','AD Group Request: Rev Repo +rts - Modify Access',1,'General','No','Yes','Yes','Yes','Yes','Yes',' +No','No','No','No'],[81,'STP0028A','Adesso Cyber Pad Software Suite 3 +.13',0,'Software','No','Yes','No','Yes','Yes','No','No','No','No','No +'],[371,'STP009FB','Adesso Cyberpad 3.14 [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: Basic regex to parse source code by spazm (Monk) on Jul 13, 2009 at 17:36 UTC
Have you considered perltidy to reformat this code? (Perl::Tidy)	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.