in reply to Re: RegEx question
in thread RegEx question
Unfortunately, the text can contain whitespace, several words, sometimes comma separated.
The second example would not work because I chose a misleading example. The labels are not necessarily named Label1, Label2, ... LabelN. They can be different words. A better example would be:
"some text .... Programming Languages: C++, Java Author: John Date Cre +ated: 20004-01-05 10:23 ....."
In this case, I would need to extract the string: "C++, Java".
Again, thanks for taking the time to look at this.
Marius
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: RegEx question
by davido (Cardinal) on Oct 01, 2006 at 05:24 UTC | |
Well then, you've got a problem, but I'm going to propose a solution. First, I'll try to explain the problem. If one label is "Programming Languages:", another label is "Author:", and another is "Date Created:", you clearly cannot count on the labels not having whitespace. And if your data fields are "C++, Java", "John", "2004-01-05 10:23...", you clearly cannot count on your data fields not containing whitespace. Your fields aren't of fixed width either. And your delimiter (the colon) appears mid-record, so it's more of an anchor than a delimiter, which doesn't help tremendously. What that leaves you with is this: No good way of determining where a data field ends, and where a new label starts. ......unless, of course.... unless you're lucky enough to know all the possible labels. Maybe you could instead skim for known labels. That would be helpful. For example, if you know that the only labels in the text are "Programming Languages", "Author", and "Date Created", you could compose your regular expression like this:
This will capture the known label into $label on each iteration, and then the field following the label into $data. Each match stops as soon as the lookahead assertion finds the next known label, or the end of the string. Dave | [reply] [d/l] |
|
Re^3: RegEx question
by GrandFather (Saint) on Oct 01, 2006 at 05:29 UTC | |
Are you getting the original data one entry at a time, or are several entries munged together? The solution is simple if you get the data one entry per line and there is no more than one word for the second label ('Author' in your example). A slight modification of McDarren's sample is what you are after:
Prints:
DWIM is Perl's answer to Gödel | [reply] [d/l] [select] |
|
Re^3: RegEx question
by McDarren (Abbot) on Oct 01, 2006 at 05:05 UTC | |
Could you post 3-4 full lines of the actual data that you are working with? (edit anything that may be sensitive, of course). Update: Just looked at this again. Will the word "Author" always follow the text that you need to extract? If yes, then you could probably just do:
But again, difficult to say for certain without seeing a few more lines of data and having the requirements clarified a bit. | [reply] [d/l] |
by mariuspopovici (Initiate) on Oct 01, 2006 at 14:01 UTC | |
I appologize for not explaining this better. Here's what actual data looks like:
This is basically, a SourceForge project summary page where I strip all HTML and end up with this text. I want to be able to extract as many project attributes as possible, such as: Programming Language, Translations, Developers, Activity, Topic and so on. Now here's where the problem comes, some projects will have some attributes and other projects won't have them. For example, not all projects have the "User Interface Attribute" or the "Donors" attribute, etc... This basically makes for not being able to depend on the order in which these attributes appear in the text. So because the order is inconsistent from a project to another I can't do something like:
and would have to rely on something else. Is there any way to extract the pattern that matches a label and have something like:
? This way I can make a list of all possible labels and look for each one individually | [reply] [d/l] [select] |
by Hue-Bond (Priest) on Oct 01, 2006 at 14:12 UTC | |
This is basically, a SourceForge project summary page where I strip all HTML and end up with this text. It's better to use one of the many HTML modules available: HTML::Parse, HTML::Parser, HTML::TreeBuilder, HTML::SimpleParse, HTML::TagParser, HTML::TableExtract, YAPE::HTML... geez, I didn't know there were so many! -- | [reply] |
|
Re^3: RegEx question
by shmem (Chancellor) on Oct 01, 2006 at 08:42 UTC | |
I'ts annoying to find out that a given solution doesn't fit because the problem wasn't exposed clearly in the first place. --shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] |