Re: Find a specific word in a text file
by ysth (Canon) on Sep 09, 2004 at 06:13 UTC
|
If you are going to analyze character by character, you probably want the whole file in one string and to find just
the offset of your word. The first part is called slurping the file; one method is to set $/ to undef (see perlvar), then read the file with <> or readline; this will read the whole file. If you then match in scalar context using the g flag, the pos of the string will be set to the end of your match. You can then examine nearby characters with substr.
I'd provide an example, but it's not clear exactly what you
are going to do once you find a match.
| [reply] |
|
|
Thanks a lot. I now can locate and get the position index of the word I was searching. Now I need to move some blank spaces and 4 letters forward to grab five digits. How do I do that, where can I find an example? Thanks.
| [reply] |
|
|
Easiest way is to (adjust pos($yourstring) if needed and) do a scalar match with the //g or //gc flags. Untested:
# find word
if ($string =~ /\bword\b/g) {
# skip forward from end of word 12 characters (just an example)
pos($string) = pos($string)+12;
# skip forward over whitespace and grab 5 digits:
if ($string =~ /\s*(\d{5})/gc) {
print "here are your digits: $1";
# now pos points to after digits
} else {
# pos is unchanged (due to /c flag); try something else
}
}
| [reply] [d/l] |
|
|
Re: Find a specific word in a text file
by CountZero (Bishop) on Sep 09, 2004 at 06:22 UTC
|
I see you are new to our Monastery: so let me bid you welcome and may your search for more Perl-wisdom be fruitful.It is the custom in our Monastery that you show the code you have written to solve your problem or otherwise show us that you have at least tried to solve the problem. The Monastery is not a "Solve-my-homework-for-me"-service! That being said, did you try to use the m// function? It tries to match a regular expression against another string and returns true if the match succeed: use strict;
my $text='Try to find the hidden string here!';
if ($text=~m/\bhidden\b/) {
print "We found the hidden string!\n";
}
else {
print "No match, sorry.\n";
}
Some explanation:- =~ binds the match operator to the $text variable, so it knows where to look.
- m/ / is the match function.
- \b matches on any boundary between a word and not a word, i.e. the beginning or end of a word, so you are certain not to look for part of words.
- hidden is of course the string to search for.
Now for the other part of your question: What other word do you need to grab? Do you know what that word will be beforehand or is it just any word before or after the first word you found? Why do you want to analyse it character-by-character? That seems a bit artificial.Update: as japhy and ysth told me, the marker for the word boundary is \b. I have updated the example program above. What do we learn from such errors: Never try to do any serious work before your second cup of strong tea!
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] [select] |
|
|
You mean \b, not \w.
_____________________________________________________
Jeff japhy Pinyan,
P.L., P.M., P.O.D, X.S.:
Perl,
regex,
and perl
hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
| [reply] |
|
|
Thank you for your input countzero, I appreciate it.
| [reply] |
|
|
Thanks CZ, I found "Today's ID" via
while (<>) {
if (/Today's ID/) {
print $_; }
now I need to move some blank spaces and 4 letters forward to grab five digits.
(sorry, I loose the format of my message when I click to release)
| [reply] |
Re: Find a specific word in a text file
by davido (Cardinal) on Sep 09, 2004 at 06:26 UTC
|
Try to be more specific and detailed about what you need. "move forward or backward to grab another word nearby analysing character by character" could mean a lot of things, and the implementation will change depending on what it is that actually means.
However, lets assume that you've got the entire webpage in $page:
if ( ( my $position = index $page, "word" ) >= 0 ) {
print "Found at $position.\n";
}
Once you've told us what you mean by that second part to the question we can help you to figure out how to "go forward or backward". ...it may be that the whole thing belongs in a regexp anyway.
| [reply] [d/l] [select] |
|
|
Thanks for the help Dave. I've tried it
use LWP::Simple;
my $page = get("http://www.google.de");
if ( ( my $position = index $page, "font-family" ) >= 0 ) {
print "Found at $position.\n";
}
but
did not produce any results even though the souce from the page contains:
</title><style><!--body,td,a,p,.h{font-family:arial,sans-serif;}
| [reply] [d/l] [select] |
|
|
my $page = get("http://www.whoever.de");
die "Couldn't get the page!\n" unless defined $page;
It could be that you're not even succeeding in fetching the page. Next, dump $page into a text file where you can examine it later to see if it really contains the text you're looking for (without any line-breaks, etc). If your HTML parsing needs get fairly elaborate, you might want to look at HTML::TokeParser anyway; use a powertool when a powertool is needed.
| [reply] [d/l] |
|
|
|
|
OK Dave now it works: Found at 5477.
Now I need to move some blank spaces and 4 letters forward to grab five digits. Any ideas?
| [reply] |
|
|
davido has given you good advice. I would go further and say _don't_ parse HTML even in apparently straightforward cases.
There is often (always?) shed loads of arbitray white space which can easily defeat a regex. The HTML can be 'loose', 'strict' and change every day!
Once you've used HTML::TokeParser (if I can, anybody can!) you'll be able to reuse the code in any future apps.
Have a look at this tutorial. There is an example here. Search and Supersearch will find many more.
It does seem like a lot of trouble if you are in a hurry! But I assure you the effort will pay dividends.
Best of luck, wfsp
| [reply] |
|
|
|
|
I wish you had just gone ahead and asked the whole question at once as we requested. Breaking a single question into pieces and only feeding us one piece at a time might seem to be a good approach to you, but trust me, we can take bigger bites. We don't want to write your script for you, but if we're going to answer a question, at least let us answer the complete question.
You really should be using HTML::TokeParser. Nevertheless, the following will use a fragile regexp to find a keyword, and grab the digits that immediately follow it (whitespace optional).
if( $page =~ m/keyword\s+(\d+)/ ) {
print "Found the keyword, and retrieved a value of $1\n";
}
Now if your HTML has multiple instances of this keyword, you'll have to ask us another question, or read perlretut and perlrequick.
By the way, if you're screen-scraping Google you are violating their Terms of Service, and exposing yourself to civil liability. From the Google Terms of Service page:
No Automated Querying
You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:
- using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries;
- "meta-searching" Google; and
- performing "offline" searches on Google.
Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
| [reply] [d/l] |
|
|
Unless I'm mistaken, it appears that what you're trying to do can be solved with a fairly simple regular expression. Unless you actually need the index, try:
if($page =~ m/\bword\b\s+.{4}(\d{5})/)
{
print "The number is $1";
}
else
{
print "No match";
}
From left-to-right, the expression states:
\bword\b - Find the text 'word'
\s+ - Followed by one or more whitespace characters
.{4} - Followed by any four characters
(\d{5}) - Followed by five digits
The parenthesis around the '\d{5}' instruct perl to store this match in the variable $1 (for the purposes of this discussion). So the matched digits are stored in $1. | [reply] [d/l] |