Learn,Microsoft,Internet,Control,create,your,
Quick Search for:  in language:    
Learn,Microsoft,Internet,Control,create,your,
   Code/Articles » |  Newest/Best » |  Community » |  Jobs » |  Other » |  Goto » | 
CategoriesSearch Newest CodeCoding ContestCode of the DayAsk A ProJobsUpload
RentACoder Stats

 Code:  lines
 Jobs: 0 postings

 
Sponsored by:

 

You are in:

 
Login



Latest Code Ticker for RentACoder.
Wrapping Scrolling Text
By Paranoid_Androi d on 7/2


Create A Dummy File
By AML on 7/2


Click here to see a screenshot of this code!Captionbar manipulation!
By Peter Hebels on 7/2

(Screen Shot)

A Game Of War
By Co0nest on 7/2


Click here to see a screenshot of this code!KeyGen Example
By Bengie|NET on 7/2

(Screen Shot)

Click here to see a screenshot of this code!OpenBrowser v1.9
By Orlando Jerez on 7/2

(Screen Shot)

SendMessageBySt ring() Example
By Jaime Muscatelli on 7/2


Click here to see a screenshot of this code!FirstSunday
By Jan Paul Penning on 7/2

(Screen Shot)

Click here to see a screenshot of this code!Ikonz v1.0
By Gaurav Creations on 7/2

(Screen Shot)

Click here to put this ticker on your site!


Add this ticker to your desktop!


Daily Code Email
To join the 'Code of the Day' Mailing List click here!





Affiliate Sites



 
 
   

Create Your Own Personal Internet Web-bot

Print
Email
 

Submitted on: 4/21/2000 7:51:48 PM
By: Ian Ippolito (psc)  
Level: Intermediate
User Rating: By 16 Users
Compatibility:

Users have accessed this article 24697 times.
 

(About the author)
 
     Learn how to use the Microsoft Internet Control to create your own custom web-bots that can scour the web for whatever information you're looking for!

This article has accompanying files

 
 
Terms of Agreement:   
By using this article, you agree to the following terms...   
1) You may use this article in your own programs (and may compile it into a program and distribute it in compiled format for langauges that allow it) freely and with no charge.   
2) You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
3) You may link to this article from another website, but ONLY if it is not wrapped in a frame. 
4) You will abide by any additional copyright restrictions which the author may have placed in the article or article's description.
From Inside Visual Basic Magazine,March 2000
Reposted with Permission of ZD Net Journals

There's no arguing that the Internet lets us access amazing volumes of information on virtually any subject. However, if you're like us, you may have found it difficult to filter out unnecessary information from this enormous repository. Gathering specific facts can be time consuming, with data usually scattered across many sites. Search engines like Yahoo!, HotBot, and even Ask Jeeves, have attempted to fill this void, but have been only partially successful. A recent study found that search engines have indexed less than 55 percent of the Web. The same study predicted that this percentage would in fact continue to shrink as the number of new pages on the Internet grows.

In the future, people will probably turn to personal, automated search programs to find what they need. These Web-bots provide more targeted and thorough searches. In this article, we'll look at the Web-bot shown in Figure A, which lets you research any topic on the Internet. Then, we'll cover a few of the basics you'll need to create a Web-bot fit to rival Jeeves himself!

To boldly go where no Web-bot has gone before

We included both the Web-bot's project files and a compiled EXE in this month's download. For now, launch the EXE. To begin, enter the subject you want to research in the Subject text box. For our example, we satisfied our Star Trek craving.

Next, indicate how thorough a search you want the bot to conduct in the Search Depth text box. High numbers make for in-depth searches, but take longer to complete. Lower numbers are less thorough but complete much quicker. If you have a slow Internet connection and only a few minutes to run the Web-bot, consider entering a 2 or 3. If you have a fast Internet connection or have a lot of time (for example, you may be running the program over-night), enter a higher number like 9 or 10. The Web-bot doesn't care how high you make this number. As you can see in Figure A, we entered 3 for our search depth.

Full speed ahead, botty

Now, select the Show In Browser check box. This option lets you monitor the bot's progress in the right browser window. The other browsing check box, Stop Each Page, pauses the Web-bot after each page to allow you to monitor the results. Chances are, if you want to run the bot unattended, you won't want to use this option.

Finally, tell the Web-bot where to start. Search engines can be good launching points, so if you want to start with one of these, choose the corresponding option button. If you want to start at a custom URL, click the Custom URL option button, and then enter the URL in the text box.

Now that we've set the Web-bot's options, we're ready to launch it. To do so, click Start Search, and then click Yes when the program asks if you're conducting a new search. That done, the Web-bot races ahead at warp speed, looking for the information you requested. (OK, that's the last of the Star Trek references, promise!)

At any time, if you wish to take a closer look at a URL, just click the Pause button. Then, find a URL in the treeview and right-click on it. Doing so transports the page into the browser on the right side. The program also logs email addresses, as well as the URLs, in a local Access 97 database for your later perusal. We called this database WebAgent.mdb.

The anatomy of a Web-bot

Now that we've looked at a working Web-bot, let's take a look at some of the necessary features that you'll need when you create your own. For space considerations, we won't get into the form's exact design. However, Figure A should provide a blueprint for your own layout.

In addition to the controls visible at runtime, Figure B shows the few controls not visible. As you can see, we've placed an ImageList and Inet control on the form. Also, the larger box at the very bottom is an RTFTextbox control. Finally, note that in the main body of the Web-bot, we used a Treeview to list the Web sites and email addresses, and a Browser control to display the pages. Now, let's take a look at the more complex features.

Figure B: We'll import HTML pages into the RTFTextbox control, and then use its Find method to search the HTML for the selected topic.

Navigating to a Web page

The program gains its ability to load Internet Web pages from the Microsoft Internet control (shdocvw.oca). To use it, simply drop the control onto a form and use the Navigate method. In our Web-bot, the function mNavigateToURL accomplishes this task, as well as provides time-out error trapping and the code to transfer raw HTML to the RTFTextbox control for later use. Listing A shows the code for this procedure. Note that vstrURL contains the URL that the Web-bot is currently analyzing.

Listing A: Navigating to a URL

Function mNavigateToURL(ByRef rIntInternetControl _

As Inet, ByRef rbrwsBrowserControl As WebBrowser, _

ByRef rrtfTextBox As RichTextBox, ByRef vstrURL _

As String) As Boolean

'set default

mNavigateToURL = False

On Error GoTo lblOpenError

rIntInternetControl.URL = vstrURL

rIntInternetControl.AccessType = icDirect

frmWebBot.sbWebBot.Panels(1).Text = "Loading " _

& vstrURL & "..."

rrtfTextBox.Text = rIntInternetControl.OpenURL

frmWebBot.sbWebBot.Panels(1).Text = ""

On Error GoTo 0

If (frmWebBot.chkShowInBrowser = vbChecked) Then

rbrwsBrowserControl.Navigate vstrURL

End If

mNavigateToURL = True

Exit Function

lblOpenError:

Select Case (Err.Number)

Case 35761

'timeout

Case Else

End Select

End Function

Displaying Web pages

Once the Inet control loads a page, the Web-bot needs to display it in the right pane of the main control panel. The Microsoft Web Browser control (located in the same control library as the Internet control we just mentioned) makes it very easy to do so. The following code causes the browser to display the current page:
rbrwsBrowserControl.Navigate vstrURL

Analyzing a page

After loading and displaying a page, the Web-bot reads it. Our particular Web-bot requires two different pieces of information:

 

  • The email addresses located on the page.
  • The links that exit the page, so the Web-bot can continue its journey.
As you'll recall from mNavigateToURL, the Web-bot stores the raw HTML for the page in a Rich Text Box control, rrtfTextBox. The control's built in Find method allows the Web-bot to perform some rudimentary searching, but the procedure must also parse the HTML document from a specific starting and ending delimiter, and extract the text that lies in between. We created the mExtractHTML function in Listing B to accomplish this task. If it finds what it's looking for, it returns the HTML contents. Otherwise, it returns the empty string.

Listing B: The mExtractHTML function

Function mExtractHTML(ByVal vstrStartDelimiter _

As String, ByVal vstrEndDelimiter As String, _

ByRef rrtfHtml As RichTextBox, ByRef _

rrlngPageIndex As Long) As String

Dim lngStringStart As Long

Dim lngStringEnd As Long

On Error GoTo lblError

If (vstrStartDelimiter <> "") Then

'normal

rrlngPageIndex = rrtfHtml.Find(vstrStartDelimiter, _

rrlngPageIndex + 1)

lngStringStart = rrlngPageIndex + _

Len(vstrStartDelimiter)

Else

'start at current position

lngStringStart = rrlngPageIndex

End If

'find ending delimiter

rrlngPageIndex = rrtfHtml.Find(vstrEndDelimiter, _

lngStringStart + 1)

lngStringEnd = rrlngPageIndex - 1

'extract text

rrtfHtml.SelStart = lngStringStart

rrtfHtml.SelLength = lngStringEnd - lngStringStart + 1

mExtractHTML = rrtfHtml.SelText

'set output value

rrlngPageIndex = lngStringEnd + Len(vstrEndDelimiter)

On Error GoTo 0

Exit Function

lblError:

mExtractHTML = "ERROR"

End Function

The functions mcolGetAllUrlsInPage and mcolExtractAllEmailAddressesOnPage build on the previous function and return the links or email addresses (respectively) back to the calling routine via a collection. These functions are smart enough to remove links and email addresses that might appear valid to a less sophisticated Web-bot, but really wouldn't be applicable. For example, most email addresses to mailing lists are of the format subscribe@somedomain.com. The routine weeds these out. Other examples of screened email addresses include sales@somedomain.com and support@somedomain.com.

Avoiding infinite loops

Some pages either link back to themselves or link to other pages that eventually loop back to the original page. If a Web-bot doesn't keep an eye out for such pages, it can easily fall into an infinite loop. To avoid this trap, our Web-bot does two things. First, it uses the function mSaveVisitedUrl to store every URL in the Access database. As you can see if you view the code in this month's download, this function uses standard ADO code for saving data to a database.

Second, before going to any new URL, it determines if it already visited the page. To do so, it calls mblnAlreadyVisiting, shown in Listing C. If the database contains the URL, then the Web-bot skips the page, thus short-circuiting the infinite loop.

Listing C: Code to detect duplicate URL

Function mblnAlreadyVisiting(ByVal vstrURL As String)

Dim objConnection As ADODB.Connection

Dim objRecordset As ADODB.Recordset

'connect to database

ConnectToDatabase objConnection

Dim strSQL As String

strSQL = "SELECT * FROM WebBot_Visited_Url " _

& "WHERE url='" & vstrURL & "'"

Set objRecordset = New ADODB.Recordset

On Error GoTo lblOpenError

objRecordset.Open strSQL, objConnection, _

adOpenForwardOnly, adLockPessimistic

On Error GoTo 0

If objRecordset.EOF = False Then

'found

mblnAlreadyVisiting = True

Else

'not found

mblnAlreadyVisiting = False

End If

'close recordset

objRecordset.Close

Set objRecordset = Nothing

DisconnectFromDatabase objConnection

Exit Function

lblOpenError:

End Function

Resuming operation after stopping

Should anything unforeseen happen during a Web-bot search, such as the operating system crashing or the computer getting switched off, the search would normally have to be completely rerun. However, this would not be a happy prospect for someone who was a few hours, or days, into a search, so the Web-bot code is built to handle this contingency.

To allow the user to resume his search, the Web-bot uses the same URL log that protects against infinite loops to keep track of the currently visited URL. If the application gets prematurely shut down, it will simply pick up where it left off.

Conclusion

Web-bots make the Web infinitely more useful because they allow you to pull in more information than a mere search engine, and allow you to gather the information into a useful format. The uses for a Web-bot are only limited by your imagination, and with this article, you now have the tools to build whatever you can dream

winzip iconDownload article

Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. Afterdownloading it, you will need a program like Winzipto decompress it.

Virus note:All files are scanned once-a-day by Planet Source Code for viruses,but new viruses come out every day, so no prevention program can catch 100% of them.

FOR YOUR OWN SAFETY, PLEASE:
1)Re-scan downloaded files using your personal virus checker before using it.
2)NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.

If you don't have a virus scanner, you can get one at many places on the net including:McAfee.com

 
Terms of Agreement:   
By using this article, you agree to the following terms...   
1) You may use this article in your own programs (and may compile it into a program and distribute it in compiled format for langauges that allow it) freely and with no charge.   
2) You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
3) You may link to this article from another website, but ONLY if it is not wrapped in a frame. 
4) You will abide by any additional copyright restrictions which the author may have placed in the article or article's description.


Other 71 submission(s) by this author

 

 
Report Bad Submission
Use this form to notify us if this entry should be deleted (i.e contains no code, is a virus, etc.).
Reason:
 
Your Vote!

What do you think of this article(in the Intermediate category)?
(The article with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor See Voting Log
 
Other User Comments
5/3/2000 10:33:00 PM:Nova
Doh! This program looks really cool + totally useful, but the 'ADO cant find the specified provider' ! Help Please?!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/3/2000 11:55:26 PM:Nova
Ahh, doesn't matter. Did a temp fix with a bit of error handling. By the way Ian, this Web bot is really excellent.
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/4/2000 9:19:16 PM:Ian Ippolito
Thanks Nova! If you care, you can probably download the latest version of ADO from Microsoft (search for MDAC on their site). Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/7/2000 12:27:25 AM:Messiacle
Help me!! when i try to open the webbot, it says this:component comctl32.ocx or one of it dependencies not correctly registered: a file is missing or invalid. also, when i try to open the project, it says it is corrupt. i have vb4, 32 bit. thanks!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/8/2000 6:03:15 PM:Ian Ippolito
Mesiacle, The project was written in vb6 and requires either that or vb5. Sorry...I'll update the compatibility of the submission. Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/29/2000 1:04:51 AM:masika
Hmm your program sounds grate! But i get the ADO thing too. I would wonder do you need Excel or access installed? I updated my ADO stuff it still says that.. I have vb 6 and vb 5 neither make it work.. Please help.
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/29/2000 11:34:33 AM:Ian Ippolito
Masika, The program doesn't require Access or Excel...so I'm not sure what the problem is. I would suggest writing to Nova (39080517@pager.icq.com) who ran into the same thing and fixed it...hopefully he can post his solution here. Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
5/30/2000 7:18:20 PM:Masika
There is one error the ADO error I updated my ADO recently and it still didn't work.
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/9/2000 12:21:55 AM:etrask
I get an error that says my msinet.ocx is out of date or i dont have it i know i have it and i updated it too but still no fix i want to use this program! looks cool! could use to...um...never mind
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/9/2000 12:27:42 AM:Nova
The only thing I did was put in some On Error Resume Next whereever the database code was bringup an error. Although using this method will mean you will have to forfit the database functionality. Note: This fix is only for run time errors.
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/9/2000 5:54:58 PM:Ian Ippolito
Etrask, I'll email you the latest version. Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/11/2000 2:17:42 AM:etrask
Hey Ian, I can run the program now but I got a lot of errors in the actual code which is weird. Anyway, I'll do some error handling. A few suggestions: make it so that when you minimize it, it goes into the systray. And, make the border type a fixed single. Sorry I'm telling you how to make your program. anyway, really nice peice of code!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/22/2000 1:38:37 PM:Manu
Great code. Thanx Ian!!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/28/2000 11:08:29 PM:Lyndon
I ran the code and it ran picture perfect right from the start. Everyone Is always going to Config problems with the differnt Equipment. I recomend this as an Excellent Piece of Code. Thank You
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
7/11/2000 8:02:37 AM:Vince Agwada
This is a great piece of code. Thanks!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
7/12/2000 2:00:33 PM:Detonate
excellent job, Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
4/10/2001 7:33:13 PM:etrask
I've had this problem for quite a while so if someone could help me I would really appreciate it. I'm using VB 5.0 Pro edition. I can never get a label to stay above a frame. The frame always goes over it. In this program there are tons of labels that are above the frame yet I can't find out how. If you could help me, thank you. (Please send the solution to my email, listed above.)
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/13/2001 4:42:03 PM:Brian PEal
Ian, I came across a problem in my project which uses the OpenURL command of ITC. I noticed that your program does not account for this potential issue. When you use the OpenURL(destination) and the destination redirects you to another web site, how do you get the URL of the new destination? Example:
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/14/2001 12:08:20 AM:Ian Ippolito
Brian, You're right. When I was writing this code, I didn't know of any way to get this information from the URL control...it doesn't seem to expose any properties to allow you to detect this situation. Perhaps someone else on this site knows of a solution? Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/16/2001 2:09:41 PM:FTWJFIA
I got this and it worked fine right off the start, but after searching through about 30 sites it poped up with a critical error about database overflow or something like that. is this the same error everyone else has been getting??
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/25/2001 5:49:47 PM:Janus
I would'nt make the code so it uses mblnAlreadyVisiting! An easy way to accomplish this task is to make a primary key in the db based on the url! That way no urls would be dubs, and the performance would be much better, cause no need for making a select-statement, connection and do the logic ! Regardz Janus
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
6/25/2001 6:10:12 PM:Ian Ippolito
Janus, There is always more than 1 way to skin a cat. That would certainly work. Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
7/7/2001 2:43:14 AM:Dr. Frost
Is there a way, or should I say, How do you limit the number of hits (say, 300), and then rank them by relevance? BTW, sweet code!
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
7/7/2001 1:10:20 PM:Ian Ippolito
Currently no...I'll leave relevance ranking to someone else here who wants to take on this challenge? Ian
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
8/30/2001 2:09:31 PM:karen
Is it possible to change the webagent user-interface? And how???
Keep the Planet clean! If this comment was disrespectful, please report it:
Reason:

 
Add Your Feedback!
Note:Not only will your feedback be posted, but an email will be sent to the code's author in your name.

NOTICE: The author of this article has been kind enough to share it with you.  If you have a criticism, please state it politely or it will be deleted.

For feedback not related to this particular article, please click here.
 
Name:
Comment:

 

Categories | Articles and Tutorials | Advanced Search | Recommended Reading | Upload | Newest Code | Code of the Month | Code of the Day | All Time Hall of Fame | Coding Contest | Search for a job | Post a Job | Ask a Pro Discussion Forum | Live Chat | Feedback | Customize | RentACoder Home | Site Home | Other Sites | About the Site | Feedback | Link to the Site | Awards | Advertising | Privacy

Copyright© 1997 by Exhedra Solutions, Inc. All Rights Reserved.  By using this site you agree to its Terms and Conditions.  Planet Source Code (tm) and the phrase "Dream It. Code It" (tm) are trademarks of Exhedra Solutions, Inc.