HTMLContentParser ASP.NET Project using VB.NE - .NET, DotNet, C#, ASP.NET, VB.NET, C++.Net, M

HTMLContentParser ASP.NET Project using VB.NET

	Print
	Email

Submitted on: 6/18/2002 11:05:03 PM
By: SoftwareMaker
Level: Advanced
User Rating:

By 4 Users
Compatibility:VB.NET, ASP.NET

Users have accessed this article 16455 times.

(About the author)

This project is a HTML Content Parser. It gets a stream of HTML Content from a specified URL Web Page. Then it sets to go through whole stream extracted and picks out the HTML HyperLinks and Images and displays them in an HTML Table in a hyperlink format for users to click on directly to get there. This will be particularly useful for uses who are interested in some images on websites and finds it tedious to look through the view source of the pages to extract out the image sources of the page. Please do not forget to give me your votes. Thank you very much.

This article has accompanying files

Terms of Agreement:
By using this article, you agree to the following terms...
1) You may use this article in your own programs (and may compile it into a program and distribute it in compiled format for languages that allow it) freely and with no charge.
2) You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.
3) You may link to this article from another website, but ONLY if it is not wrapped in a frame.
4) You will abide by any additional copyright restrictions which the author may have placed in the article or article's description.

This project I post today really has not much of a practical or functional value to anyone. (alto I think they are pretty cool to web designers and developers. I am sure there will be detractors out there.) It is just to showcase the use of some ASP.NET objects and how easy it is to use them and also do some simple string manipulation.

Please do check out the live version of this project from my website. Please click here to get there now. Please feel free to post any comments or criticisms on my project and articles.

Lets make use of MS.NET's more intuitive OOP features to separate encapsulate and group different functions into classes and assemblies for easier maintanance.

This code here goes into a Class called
HTMLContentParser.vb
'///////////////////////////
Imports System.IO
Imports System.Net
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Public Class HTMLContentParser
Function Return_HTMLContent(ByVal sURL As String)
Dim sStream As Stream
Dim URLReq As HttpWebRequest
Dim URLRes As HttpWebResponse
Try
URLReq = WebRequest.Create(sURL)
URLRes = URLReq.GetResponse()
sStream = URLRes.GetResponseStream()
Return New StreamReader(sStream).ReadToEnd()
Catch ex As Exception
Return ex.Message
End Try
End Function
Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then
sURL += "/"
End If
'FILTERING
'Filter down to the Domain Name Directory from the Right
Dim iCount As Integer
For iCount = sURL.Length To 1 Step -1
If Mid(sURL, iCount, 1) = "/" Then
sURL = Left(sURL, iCount)
Exit For
End If
Next
'Filter out the ">" from the Left
For iCount = 1 To sInput.Length
If Mid(sInput, iCount, 4) = ">" Then
sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
Exit For
End If
Next
'Filter out unnecessary Characters
sInput = sInput.Replace("<", Chr(39))
sInput = sInput.Replace(">", Chr(39))
sInput = sInput.Replace(""", "")
sInput = sInput.Replace("'", "")
If (sInput.IndexOf("http://") < 0) Then
If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
Return sURL & "/" & sInput
Else
If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
Return sURL.Substring(0, sURL.Length - 1) + sInput
Else
Return sURL + sInput
End If
End If
Else
Return sInput
End If
End Function
End Class
'///////////////////////////////////////////////////////

The Function getHTMLContent requires a URL parameter input in a string format. From there we use the HTTPWebRequest and HTTPWebResponse objects to send a request to the specified URL and get their HTML Content as a Response. Note the structured error handling implemented here. This structured error handling is explained in a different topic altogether. The returned value should be placed and displayed in a HTML TextBox for retrieval purposes later.

The ParseHTMLLinks and Images Functions make use of a Regex object that should be very familiar to Java and C# Developers and would look alien to VB Developers. They are actually a pattern matching object and can be used together with the Match Object. These are all objects of the System.Text.RegularExpressions Namespaces and therefore MUST be imported and declared into the VB.NET class. What they do is essentially a Regex pattern match into the Match object with the HTML Content (retrieved from an earlier HTML TextBox we use to display the retrieved HTML Content) as the source. As and when it finds the matched pattern specified by Regex, it returns the string containing the pattern, process it with ProcessURL Function and then adds it to an ArrayList. The ArrayList class is essentially the Collection class of the classic VB. It has the ability to add and remove from the collection which is far more intuitive and easier to use than the array class. Both the ParseHTMLLinks and Images return an arrayList of Links and Images.

The ProcessURL Function here essentially uses some very intrinsic VB functions and some new VB.NET ones (of which I am a developer of for years and therefore am familiar with). I also realized that some detractors out there will propose the use of the stringBuilder class as an immutable class to manipulate strings in this function. What the stringBuilder class differs from the String class is that the stringBuilder class is immutable which means it doesnt create a new instance of itself any time it is referred to. It is therefore more efficient on the machine's resources. The String class creates a new instance of itself whenever it is assigned an expression. and you can imagine the strain on resources when it the same string is manipulated 10 times, it will create 10 new instances of itself. Hardly efficient. I use the string class here because although its much more inefficient, its much more familiar to the VB Developers who are on transition to VB.NET and this topic here, more or less, focus on the ASP.NET HTTPWebRequest and HTTPWebResponse objects. I will save the Stringbuilder class topic for later articles. But I am sure other developers and authors here will and already have explained the stringBuilder class aready.

This code here goes to an ASP.NET ASPX page '////////////////////////////////////////////////////////////////////////////// Private objParser As HTMLContentParser Private Sub cmdGetHTML_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdGetHTML.ServerClick
Dim sURL As String = "http://" & txtURL.Value
txtHTMLContent.EnableViewState = False
txtHTMLContent.Value = objParser.Return_HTMLContent(sURL)
End Sub
Private Sub cmdParse_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdParse.ServerClick
Call PopulatetblParsedContent()
End Sub
Private Sub PopulatetblParsedContent() 'Populate Links Table
Dim sURL As String = "http://" & txtURL.Value
Dim myAnchor As HtmlAnchor
Dim intRows As Integer
Dim intRowCount As Integer
Dim objRow As HtmlTableRow
Dim objCell As HtmlTableCell
Dim sLinks As String
Dim sImage As String
Dim lstLinks As ArrayList = objParser.ParseHTMLLinks(txtHTMLContent.Value, sURL)
Dim lstImages As ArrayList = objParser.ParseHTMLImages(txtHTMLContent.Value, sURL)
tblParsedContent = Me.tblParsedContent
tblParsedContent.EnableViewState = False
For Each sLinks In lstLinks
objRow = New HtmlTableRow()
objCell = New HtmlTableCell()
myAnchor = New HtmlAnchor()
myAnchor.Target = "_blank"
myAnchor.InnerText = "Link: " & sLinks.ToString
myAnchor.HRef = sLinks.ToString
objCell.NoWrap = False
objCell.Controls.Add(myAnchor)
objRow.Cells.Add(objCell)
tblParsedContent.Rows.Add(objRow)
Next
For Each sImage In lstImages
objRow = New HtmlTableRow()
objCell = New HtmlTableCell()
myAnchor = New HtmlAnchor()
myAnchor.Target = "_blank"
myAnchor.InnerText = "Img: " & sImage.ToString
myAnchor.HRef = sImage.ToString
objCell.NoWrap = False
objCell.Controls.Add(myAnchor)
objRow.Cells.Add(objCell)
tblParsedContent.Rows.Add(objRow)
Next
End Sub
'/////////////////////////////////////////////////////////////////////////////////

We then go now to focus on how to extract the HTML Content and parse them. Design your ASPX page and have
1) A HTMLTextbox for users to specify the URL for processing
2) A HTMLButton called cmdGetHTML with a ServerClick event handler to handle to Click event. The event will trigger a routine that uses the HTMLContentParser class that we had coded earlier and use the getHTMLContent function to return a string of HTMLContent for display into a txtHTMLContent HTMLTextBox.
3) A HTMLTextBox called txtHTMLContent to hold the returned HTML Content
4) A HTMLButton called cmdParse with a ServerClick event handler that calls the PopulatetblParseContent
5) A HTMLTable called tblParsedContent

Personally, I think the HTMLTable server control is amazing. I have used this example here to show its intuitiveness to add cells and rows to it. Again, my detractors out there may question the routine to populate the tables. I say that this is just an article to explain one of the ways to populate a HTMLTable. It is very intuitive and I am sure most developers out there can just understand it without much explanation. It makes use of the HTMLTableRow and HTMLTableCell to add cells into rows and rows into the HTMLTable. Note the use of the For each ... in the ArrayList Collection to extract each link and image, assign them to another server control HTMLAnchor, add this HTMLAnchor to a HTMLTableCell, add the HTMLTableCell to a HTMLTableRow and finally add the HTMLTableRow to a HTMLTable. Very intuitive to program and code !

In the world of software development, the HolyGrail is seldom achieved as there is No One Right Way to do things, however there are Many Wrong Ways. This article here is more or less a tutorial on certain ASP.NET Objects and the intuitiveness of the programmatically of the web server controls. Feel free to modify them at your own pace and leisure to suite your learning curve transition from VB to VB.NET/ASP.NET. Use the StringBuilder class instead of the StringClass in the ProcessURL Function of the HTMLContentParser.vb class and after you are familiar with the program structure of the HTMLTable, use the DataSource and DataBind Techniques of the HTMLTable, the ASPDataList and the ASPDataGrid Server Controls. There is only one way to learn properly and that is from SCRATCH. In that sense, you can fully appreciate why and how you do things. After all, aint the whole world of MS.NET developed from SCRATCH which is much better than patches and add-ons to the imperfections of yesterday.

Download article

Note: Due to the size or complexity of this submission, the author has submitted it as a .zip file to shorten your download time. Afterdownloading it, you will need a program like Winzip to decompress it.

Virus note:All files are scanned once-a-day by Planet Source Code for viruses,but new viruses come out every day, so no prevention program can catch 100% of them.

FOR YOUR OWN SAFETY, PLEASE:
1)Re-scan downloaded files using your personal virus checker before using it.
2)NEVER, EVER run compiled files (.exe's, .ocx's, .dll's etc.)--only run source code.

If you don't have a virus scanner, you can get one at many places on the net including:McAfee.com

Other 2 submission(s) by this author

Report Bad Submission

Your Vote!

Other User Comments

6/19/2002 1:10:11 AM:my name is jonas
i think your code is just great man! Thanx alot.

7/4/2002 2:46:40 AM:sy
well done, greate code

11/17/2002 11:47:13 AM:Simon Johnson
Great Code - some copy paste issues but hey GREAT CODE. Just need to adapt to extract plain text and I'll be singing and dancing too ;-)

5/23/2003 6:16:20 PM:Jerome Howard
Great code, however it doesn't handle tags thats are split over multiple lines! Have you managed to find a solution to this problem

Add Your Feedback!

Note:Not only will your feedback be posted, but an email will be sent to the code's author in your name.

NOTICE: The author of this article has been kind enough to share it with you. If you have a criticism, please state it politely or it will be deleted.

For feedback not related to this particular article, please click here.