|
|
| | Terms of Agreement:
By using this article, you agree to the following terms...
1) You may use
this article in your own programs (and may compile it into a program and distribute it in compiled format for langauges that allow it) freely and with no charge.
2) You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.
3) You may link to this article from another website, but ONLY if it is not wrapped in a frame.
4) You will abide by any additional copyright restrictions which the author may have placed in the article or article's description. | From
Inside Visual Basic Magazine,March 2000
Reposted with Permission of ZD Net Journals
There's no arguing that the Internet lets us access
amazing volumes of information on virtually any subject. However, if you're like
us, you may have found it difficult to filter out unnecessary information from
this enormous repository. Gathering specific facts can be time consuming, with
data usually scattered across many sites. Search engines like Yahoo!, HotBot,
and even Ask Jeeves, have attempted to fill this void, but have been only
partially successful. A recent study found that search engines have indexed less
than 55 percent of the Web. The same study predicted that this percentage would
in fact continue to shrink as the number of new pages on the Internet grows.
In the future, people will probably turn to personal,
automated search programs to find what they need. These Web-bots provide more
targeted and thorough searches. In this article, we'll look at the Web-bot shown
in Figure A, which lets you research any topic on the Internet. Then, we'll
cover a few of the basics you'll need to create a Web-bot fit to rival Jeeves
himself!
To boldly go where no Web-bot has gone before
We included both the Web-bot's project files and a compiled
EXE in this month's download. For now, launch the EXE. To begin, enter the
subject you want to research in the Subject text box. For our example, we
satisfied our Star Trek craving.
Next, indicate how thorough a search you want the bot to
conduct in the Search Depth text box. High numbers make for in-depth searches,
but take longer to complete. Lower numbers are less thorough but complete much
quicker. If you have a slow Internet connection and only a few minutes to run
the Web-bot, consider entering a 2 or 3. If you have a fast Internet connection
or have a lot of time (for example, you may be running the program over-night),
enter a higher number like 9 or 10. The Web-bot doesn't care how high you make
this number. As you can see in Figure A, we entered 3 for our search depth.
Full speed ahead, botty
Now, select the Show In Browser check box. This option lets
you monitor the bot's progress in the right browser window. The other browsing
check box, Stop Each Page, pauses the Web-bot after each page to allow you to
monitor the results. Chances are, if you want to run the bot unattended, you
won't want to use this option.
Finally, tell the Web-bot where to start. Search engines
can be good launching points, so if you want to start with one of these, choose
the corresponding option button. If you want to start at a custom URL, click the
Custom URL option button, and then enter the URL in the text box.
Now that we've set the Web-bot's options, we're ready to
launch it. To do so, click Start Search, and then click Yes when the program
asks if you're conducting a new search. That done, the Web-bot races ahead at
warp speed, looking for the information you requested. (OK, that's the last of
the Star Trek references, promise!)
At any time, if you wish to take a closer look at a URL,
just click the Pause button. Then, find a URL in the treeview and right-click on
it. Doing so transports the page into the browser on the right side. The program
also logs email addresses, as well as the URLs, in a local Access 97 database
for your later perusal. We called this database WebAgent.mdb.
The anatomy of a Web-bot
Now that we've looked at a working Web-bot, let's take a
look at some of the necessary features that you'll need when you create your
own. For space considerations, we won't get into the form's exact design.
However, Figure A should provide a blueprint for your own layout.
In addition to the controls visible at runtime, Figure B
shows the few controls not visible. As you can see, we've placed an ImageList
and Inet control on the form. Also, the larger box at the very bottom is an
RTFTextbox control. Finally, note that in the main body of the Web-bot, we used
a Treeview to list the Web sites and email addresses, and a Browser control to
display the pages. Now, let's take a look at the more complex features.
Figure B: We'll import HTML pages into the
RTFTextbox control, and then use its Find method to search the HTML for the
selected topic.
Navigating to a Web page
The program gains its ability to load Internet Web pages
from the Microsoft Internet control (shdocvw.oca). To use it, simply drop the
control onto a form and use the Navigate method. In our Web-bot,
the function mNavigateToURL accomplishes this task, as well as
provides time-out error trapping and the code to transfer raw HTML to the
RTFTextbox control for later use. Listing A shows the code for this procedure.
Note that vstrURL contains the URL that the Web-bot is currently
analyzing.
Listing A: Navigating to a URL
Function mNavigateToURL(ByRef rIntInternetControl
_
As Inet, ByRef rbrwsBrowserControl As WebBrowser, _
ByRef rrtfTextBox As RichTextBox, ByRef vstrURL _
As String) As Boolean
'set default
mNavigateToURL = False
On Error GoTo lblOpenError
rIntInternetControl.URL = vstrURL
rIntInternetControl.AccessType = icDirect
frmWebBot.sbWebBot.Panels(1).Text = "Loading "
_
& vstrURL & "..."
rrtfTextBox.Text = rIntInternetControl.OpenURL
frmWebBot.sbWebBot.Panels(1).Text = ""
On Error GoTo 0
If (frmWebBot.chkShowInBrowser = vbChecked) Then
rbrwsBrowserControl.Navigate vstrURL
End If
mNavigateToURL = True
Exit Function
lblOpenError:
Select Case (Err.Number)
Case 35761
'timeout
Case Else
End Select
End Function
Displaying Web pages
Once the Inet control loads a page, the Web-bot needs to
display it in the right pane of the main control panel. The Microsoft Web
Browser control (located in the same control library as the Internet control we
just mentioned) makes it very easy to do so. The following code causes the
browser to display the current page:
rbrwsBrowserControl.Navigate vstrURL
Analyzing a page
After loading and displaying a page, the Web-bot reads it.
Our particular Web-bot requires two different pieces of information:
- The email addresses located on the page.
- The links that exit the page, so the Web-bot can
continue its journey.
As you'll recall from mNavigateToURL , the
Web-bot stores the raw HTML for the page in a Rich Text Box control, rrtfTextBox .
The control's built in Find method allows the Web-bot to perform
some rudimentary searching, but the procedure must also parse the HTML document
from a specific starting and ending delimiter, and extract the text that lies in
between. We created the mExtractHTML function in Listing B to
accomplish this task. If it finds what it's looking for, it returns the HTML
contents. Otherwise, it returns the empty string.
Listing B: The mExtractHTML function
Function mExtractHTML(ByVal vstrStartDelimiter _
As String, ByVal vstrEndDelimiter As String, _
ByRef rrtfHtml As RichTextBox, ByRef _
rrlngPageIndex As Long) As String
Dim lngStringStart As Long
Dim lngStringEnd As Long
On Error GoTo lblError
If (vstrStartDelimiter <> "") Then
'normal
rrlngPageIndex = rrtfHtml.Find(vstrStartDelimiter, _
rrlngPageIndex + 1)
lngStringStart = rrlngPageIndex + _
Len(vstrStartDelimiter)
Else
'start at current position
lngStringStart = rrlngPageIndex
End If
'find ending delimiter
rrlngPageIndex = rrtfHtml.Find(vstrEndDelimiter, _
lngStringStart + 1)
lngStringEnd = rrlngPageIndex - 1
'extract text
rrtfHtml.SelStart = lngStringStart
rrtfHtml.SelLength = lngStringEnd - lngStringStart + 1
mExtractHTML = rrtfHtml.SelText
'set output value
rrlngPageIndex = lngStringEnd + Len(vstrEndDelimiter)
On Error GoTo 0
Exit Function
lblError:
mExtractHTML = "ERROR"
End Function
The functions mcolGetAllUrlsInPage and mcolExtractAllEmailAddressesOnPage
build on the previous function and return the links or email addresses
(respectively) back to the calling routine via a collection. These functions are
smart enough to remove links and email addresses that might appear valid to a
less sophisticated Web-bot, but really wouldn't be applicable. For example, most
email addresses to mailing lists are of the format subscribe@somedomain.com. The
routine weeds these out. Other examples of screened email addresses include
sales@somedomain.com and support@somedomain.com.
Avoiding infinite loops
Some pages either link back to themselves or link to other
pages that eventually loop back to the original page. If a Web-bot doesn't keep
an eye out for such pages, it can easily fall into an infinite loop. To avoid
this trap, our Web-bot does two things. First, it uses the function mSaveVisitedUrl
to store every URL in the Access database. As you can see if you view the code
in this month's download, this function uses standard ADO code for saving data
to a database.
Second, before going to any new URL, it determines if it
already visited the page. To do so, it calls mblnAlreadyVisiting ,
shown in Listing C. If the database contains the URL, then the Web-bot skips the
page, thus short-circuiting the infinite loop.
Listing C: Code to detect duplicate URL
Function mblnAlreadyVisiting(ByVal vstrURL As
String)
Dim objConnection As ADODB.Connection
Dim objRecordset As ADODB.Recordset
'connect to database
ConnectToDatabase objConnection
Dim strSQL As String
strSQL = "SELECT * FROM WebBot_Visited_Url " _
& "WHERE url='" & vstrURL &
"'"
Set objRecordset = New ADODB.Recordset
On Error GoTo lblOpenError
objRecordset.Open strSQL, objConnection, _
adOpenForwardOnly, adLockPessimistic
On Error GoTo 0
If objRecordset.EOF = False Then
'found
mblnAlreadyVisiting = True
Else
'not found
mblnAlreadyVisiting = False
End If
'close recordset
objRecordset.Close
Set objRecordset = Nothing
DisconnectFromDatabase objConnection
Exit Function
lblOpenError:
End Function
Resuming operation after stopping
Should anything unforeseen happen during a Web-bot search,
such as the operating system crashing or the computer getting switched off, the
search would normally have to be completely rerun. However, this would not be a
happy prospect for someone who was a few hours, or days, into a search, so the
Web-bot code is built to handle this contingency.
To allow the user to resume his search, the Web-bot uses
the same URL log that protects against infinite loops to keep track of the
currently visited URL. If the application gets prematurely shut down, it will
simply pick up where it left off.
Conclusion
Web-bots make the Web infinitely more useful because they
allow you to pull in more information than a mere search engine, and allow you
to gather the information into a useful format. The uses for a Web-bot are only
limited by your imagination, and with this article, you now have the tools to
build whatever you can dream
| | Download article
Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. Afterdownloading it, you will need a program like Winzipto decompress it.
Virus note:All files are scanned once-a-day by Planet Source Code for viruses,but new viruses come out every day, so no prevention program can catch 100% of them.
FOR YOUR OWN SAFETY, PLEASE: 1)Re-scan downloaded files using your personal virus checker before using it. 2)NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.
If you don't have a virus scanner, you can get one at many places on the net including:McAfee.com
| Terms of Agreement:
By using this article, you agree to the following terms...
1) You may use
this article in your own programs (and may compile it into a program and distribute it in compiled format for langauges that allow it) freely and with no charge.
2) You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.
3) You may link to this article from another website, but ONLY if it is not wrapped in a frame.
4) You will abide by any additional copyright restrictions which the author may have placed in the article or article's description. | Other 71 submission(s) by this author
| | | Report Bad Submission | | | Your Vote! |
See Voting Log | | Other User Comments | 5/3/2000 10:33:00 PM:Nova Doh! This program looks really cool +
totally useful, but the 'ADO cant find
the specified provider' ! Help Please?!
| 5/3/2000 11:55:26 PM:Nova Ahh, doesn't matter. Did a temp fix
with a bit of error handling.
By the
way Ian, this Web bot is really
excellent.
| 5/4/2000 9:19:16 PM:Ian Ippolito Thanks Nova! If you care, you can
probably download the latest version of
ADO from Microsoft (search for MDAC on
their site).
Ian
| 5/7/2000 12:27:25 AM:Messiacle Help me!! when i try to open the
webbot, it says this:component
comctl32.ocx or one of it dependencies
not correctly registered: a file is
missing or invalid. also, when i try to
open the project, it says it is
corrupt. i have vb4, 32 bit. thanks!
| 5/8/2000 6:03:15 PM:Ian Ippolito Mesiacle,
The project was written in
vb6 and requires either that or vb5.
Sorry...I'll update the compatibility
of the submission.
Ian
| 5/29/2000 1:04:51 AM:masika Hmm your program sounds grate! But i
get the ADO thing too. I would wonder
do you need Excel or access installed?
I updated my ADO stuff it still says
that.. I have vb 6 and vb 5 neither
make it work.. Please help.
| 5/29/2000 11:34:33 AM:Ian Ippolito Masika,
The program doesn't
require Access or Excel...so I'm not
sure what the problem is. I would
suggest writing to Nova
(39080517@pager.icq.com) who ran into
the same thing and fixed it...hopefully
he can post his solution here.
Ian
| 5/30/2000 7:18:20 PM:Masika There is one error the ADO error I
updated my ADO recently and it still
didn't work.
| 6/9/2000 12:21:55 AM:etrask I get an error that says my msinet.ocx
is out of date or i dont have it
i
know i have it and i updated it too but
still no fix
i want to use this
program! looks cool! could use
to...um...never mind
| 6/9/2000 12:27:42 AM:Nova The only thing I did was put in some On
Error Resume Next whereever the
database code was bringup an error.
Although using this method will mean
you will have to forfit the database
functionality.
Note: This fix is
only for run time errors.
| 6/9/2000 5:54:58 PM:Ian Ippolito Etrask,
I'll email you the latest
version.
Ian
| 6/11/2000 2:17:42 AM:etrask Hey Ian,
I can run the program now
but I got a lot of errors in the actual
code which is weird. Anyway, I'll do
some error handling. A few suggestions:
make it so that when you minimize it,
it goes into the systray. And, make the
border type a fixed single. Sorry I'm
telling you how to make your program.
anyway, really nice peice of code!
| 6/22/2000 1:38:37 PM:Manu Great code. Thanx Ian!!
| 6/28/2000 11:08:29 PM:Lyndon I ran the code and it ran picture
perfect right from the start. Everyone
Is always going to Config problems with
the differnt Equipment. I recomend this
as an Excellent Piece of Code. Thank You
| 7/11/2000 8:02:37 AM:Vince Agwada This is a great piece of code. Thanks!
| 7/12/2000 2:00:33 PM:Detonate excellent job, Ian
| 4/10/2001 7:33:13 PM:etrask I've had this problem for quite a while
so if someone could help me I would
really appreciate it. I'm using VB 5.0
Pro edition. I can never get a label to
stay above a frame. The frame always
goes over it. In this program there are
tons of labels that are above the frame
yet I can't find out how. If you could
help me, thank you. (Please send the
solution to my email, listed above.)
| 6/13/2001 4:42:03 PM:Brian PEal Ian, I came across a problem in my
project which uses the OpenURL command
of ITC. I noticed that your program
does not account for this potential
issue.
When you use the
OpenURL(destination) and the
destination redirects you to another
web site, how do you get the URL of the
new destination?
Example:
| 6/14/2001 12:08:20 AM:Ian Ippolito Brian,
You're right. When I was
writing this code, I didn't know of any
way to get this information from the
URL control...it doesn't seem to expose
any properties to allow you to detect
this situation. Perhaps someone else
on this site knows of a solution?
Ian
| 6/16/2001 2:09:41 PM:FTWJFIA I got this and it worked fine right off
the start, but after searching through
about 30 sites it poped up with a
critical error about database overflow
or something like that. is this the
same error everyone else has been
getting??
| 6/25/2001 5:49:47 PM:Janus I would'nt make the code so it uses
mblnAlreadyVisiting! An easy way to
accomplish this task is to make a
primary key in the db based on the url!
That way no urls would be dubs, and the
performance would be much better, cause
no need for making a select-statement,
connection and do the logic
!
Regardz
Janus
| 6/25/2001 6:10:12 PM:Ian Ippolito Janus,
There is always more than
1 way to skin a cat. That would
certainly work.
Ian
| 7/7/2001 2:43:14 AM:Dr. Frost Is there a way, or should I say, How do
you limit the number of hits (say,
300), and then rank them by
relevance?
BTW, sweet code!
| 7/7/2001 1:10:20 PM:Ian Ippolito Currently no...I'll leave relevance
ranking to someone else here who wants
to take on this challenge?
Ian
| 8/30/2001 2:09:31 PM:karen Is it possible to change the webagent
user-interface? And how???
| | Add Your Feedback! | Note:Not only will your feedback be posted, but an email will be sent to the code's author in your name.
NOTICE: The author of this article has been kind enough to share it with you. If you have a criticism, please state it politely or it will be deleted.
For feedback not related to this particular article, please click here. | | |