Easy .NET Screen Scraping
ASPAlliance.com: The #1 ASP.NET Community
The ASPSmith
Search
D: | Domains | Authors.aspalliance.com | Stevesmith | Articles | Easy .NET Screen Scraping
Easy .NET Screen Scraping

By Steven Smith

[Example/C#]
[Example/VB]

I named this article's file name 'netscrape', but don't confuse it with the little-used browser with a similar name. In fact, this article is all about how to use .NET's built in library to "screen scrape", that is, how to have .NET send a web request and return the resulting string to you. This required third party components under classic ASP, such as the popular AspHTTP component. However, it's really just incredibly easy with .NET.

Everything you need to do screen scraping in .NET is in the System.Net namespace. In particular, you will want to become familiar with the WebRequest and WebResponse objects, which perform the task of sending a request over HTTP and returning the response, respectively. Since actually remembering any of this when you want to grab the contents of a page can be a real strain on the brain, I've created a super simple function to remove any requirement of thought on my part. Feel free to use it in your applications -- donations and/or credit appreciated. Maybe I could pull an Amazon and patent it, like one-click ordering, because it's not immediately obvious... nah, it's pretty darned obvious, as you can see. Almost trivial. But it might save you half an hour trying to figure out the new .NET object model, and it surely saves me time remembering the object model, so it was worth writing this article about.

The key point to notice in the code below is the readHtmlPage function. The rest is just necessary code for the example. Also note that you need to include a reference to the System.IO assembly to support the StreamReader class.

/stevesmith/articles/examples/cs/scrape.aspx

<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.IO" %>
<script language="C#" runat="server">
   void Page_Load(Object Src, EventArgs E) {
      myPage.Text = readHtmlPage("http://aspadvice.com/blogs/ssmith/");
   }

   private String readHtmlPage(string url)
   {
      String result;
      WebResponse objResponse;
      WebRequest objRequest = System.Net.HttpWebRequest.Create(url);
      objResponse = objRequest.GetResponse();
      using (StreamReader sr =
         new StreamReader(objResponse.GetResponseStream()) )
      {
         result = sr.ReadToEnd();
         // Close and clean up the StreamReader
         sr.Close();
      }
      return result;
   }   
</script>
<html>
<body>
<b>This content is being populated from a separate HTTP request to
<a href="http://aspadvice.com/blogs/ssmith/">http://aspadvice.com/blogs/ssmith/</a>:</b><hr/>
<asp:literal id="myPage" runat="server"/>
</body>
</html>
C# VB JScript
Addendum: 25 April 2003

I've had a few people write me to tell me that this fails to work with international character sets. The reason for this is that in my sample code I am using the ASCII text encoding. If you need to scrape pages that include non-ASCII characters, use UTF-7 instead, as this forums post describes. Something like this:

Dim sr As New StreamReader(objResponse.GetResponseStream(), System.Text.Encoding.UTF7)
Thanks to Nicholas Wanderer for sending in this fix.


Related Articles:




ASP.NET Developer's Cookbook, By Steven Smith, Rob Howard, ASPAlliance.com 

ASP.NET By Example, By Steven Smith 




Steven Smith, MCSE + Internet (4.0)
Last Modified: 5/12/2006 9:00:08 AM
History: 1/25/2004 7:10:06 PM