|
|
|
|
|||||||||||||
Go MAD for ASP.NET with Graymad!By G. Andrew
Duthie WebClient & RegEx - Two Great Classes that Work Great Together!Pulling data from the WebHave you ever wanted to get one specific piece of data from a Web site? AspAlliance President Steve Smith has posted articles on screen scraping an entire page, and retrieving a full page that results from a POST request, (both using the WebClient class) but what if you don't want an entire page? In this article, we'll take a look at taking the screen scraping technique downscale by adding the RegEx class to locate and parse out a specific piece of data from a page. To make the results of this technique as widely available as possible, we'll look at how we can implement the idea as an ASP.NET Web service. As an example, we'll create a service that pulls sales rank data from the Amazon.com Web site. We'll pass in an ISBN number and retrieve the sales rank for the book associated with that ISBN. To start out, we'll create the basic structure for the Web service and its method: SalesRank.asmx
<%@ WebService Language="VB"
Class="SalesRank" %>
Imports System.Net Imports System.Text Imports System.Text.RegularExpressions Imports System.Web.Services <WebService(Namespace:="http://www.graymad.com/webservices/")> _ Public Class SalesRank Inherits WebService <WebMethod()> _ Public Function GetSalesRankAmazon(Isbn As String) As Integer ' implementation End Function End Class We use the @ WebService directive to specify that this file contains a Web service, and set the class for the Web service to SalesRank. We then import several namespaces to reduce the need for fully-qualifying type names. For example, importing the System.Text.RegularExpressions namespace allows us to use the RegEx class without prefacing it with System.Text.RegularExpressions each time (importing System.Net does the same for the WebClient class). Next, we define a public class and two public methods, one that returns a String, and another that returns an Integer. We add the <WebService()> attribute to the class definition to specify that ASP.NET should treat it as a Web service, and we pass a namespace parameter to help avoid namespace collisions with other Web services. For this reason, you should always use a unique value, such as the URL for a domain you control, as the namespace value. Finally we add a <WebMethod()> attribute to the public method to expose it via the Web service. Note that in Visual Basic .NET, both the <WebService()> and <WebMethod()> attributes should precede the class or method definition to which they're applied, on the same line. For readability, we've used the VB.NET line-continuation character _ to split the line into two parts. That's all the code we need for the Web service aspect of the project. Now, let's look at the actual implementation:
<WebMethod()> _ First, we declare a new instance of the WebClient class, and declare string variables for: the URL to scrape data from, the HTML returned from the scraped page, and the text from which we'll build our RegEx class. Next, we set the URL to the base URL for books on Amazon.com, and add the ISBN passed in, and a trailing slash, and then set the regex variable to "Rank:". I came up with this value by doing a View Source on several pages at Amazon, which revealed that "Rank:" is the last non-HTML text that precedes the Sales Rank value. As such, it gives us a nice starting point for the parsing we'll do after retrieving the HTML. We then create an instance of the UTFEncoding class as a helper for the data we get from the WebClient class, then retrieve the full HTML of the target page by calling the GetString method of UTFEncoding on the result of the DownloadData method of the WebClient instance. Now that we have the HTML, we switch into regular expression / parsing mode. We begin by creating a new instance of the RegEx class, using the variable we defined earlier ("Rank:", remember?) as the expression to search for. Next, we create an instance of the Match class, set to the first match of the RegEx class instance, and then declare several Integer variables to hold the locations of various points in the text. Finally, we grab the numeric location of the beginning of our target text, then use it to find the location of the end of the last HTML tag before the sales rank (again, this admittedly cobbled-together algorithm was designed by examining the HTML source of pages returned from Amazon.com), then use that value to find the location of the beginning of the first HTML tag after the sales rank data. Last, but not least, we use the RankBegin and RankEnd values to grab the string representing the sales rank data, convert it to an Integer, and return it from the Web service. Consuming the Web ServiceOnce the above code has been saved to a file with the .asmx extension and placed in an IIS application directory, you can browse the page directly and view and invoke its methods. ASP.NET will automatically create documentation pages that describe the Web service and allow you to invoke its methods, as well as show you how to create appropriate SOAP messages to call the Web service. Nice as this is, you probably want to be able to call the service from your own page in order to display the value as you choose. Fortunately, for this Microsoft has supplied a nifty utility called wsdl.exe, that will automatically create a VB.NET or C# proxy class that you can use to call the Web service. All you need to do is call wsdl.exe and pass it the URL to a WSDL (Web Services Description Language) file, which conveniently can be gotten from an ASP.NET Web service simply by appending ?WSDL to the Web service's URL, as shown below: wsdl http://www.graymad.com/salesrank.asmx?WSDL If you want, you can also pass command-line parameters to wsdl.exe to specify the language and namespace for the proxy class (run wsdl.exe /? for a list of all the command-line options). Once you've generated your proxy class, you'll need to compile it using the command-line compiler for the language you chose, as shown below (note that the compilation command should consist of a single line, even if the text below wraps): vbc /t:library /r:System.Web.dll /r:System.dll /r:System.Web.Services.dll
/r:System.Xml.dll Once you've compiled the proxy class, simply copy the resulting assembly (salesrank.dll) to the binsubdirectory of the Web application from which you wish to use it, and invoke it with the following code:
Dim SR As New SalesRank Note that you could just as easily provide a textbox or other method for users to enter an ISBN to dynamically look up. Live CodeYou can see this code in action on my Web site at http://www.graymad.com/salesrank.asmx, and on the main page at http://www.graymad.com/ where I use it for a live feed of the current sales rank for my book, Microsoft ASP.NET Step By Step. Caveats and WarningsSeveral important things have been omitted in the interest of simplifying the code examples. The two most important are input validation and error handling. Input ValidationIt is imperative when accepting input from unknown sources (which includes any Internet user), that you assume that that input is suspect until proven otherwise. Therefore, you should use some technique (such as a RegularExpressionValidator control, the RegEx class, or others) to validate that the input provided conforms to your expectations for the input. Error HandlingBecause screen scraping techniques rely on the format of the target page remaining consistent, it is especially important that you provide error handling that will allow you to either gracefully recover from failure, or at least provide users with an informative message. This won't prevent a change in the target page from breaking your code, but if you set up an error handler to email you when an error occurs, you'll know as soon as it happens so you can correct your code. GotchasTo use the wsdl.exe command-line tool and/or the command-line compilers without using the full path to the executable, you may need to add the path to these utilities to Windows' PATH environment variable. And if you're using C#, rather than VB.NET, the main thing to be aware of (other than the semicolons and curly braces <g>) is that the syntax for metadata attributes in C# uses [ ] rather than < >, so the WebService and WebMethod attributes in C# would look like the following:
[WebService(Namespace=http://www.graymad.com/webservices/")] EfficiencyI'll freely admit that the algorithm used to locate the desired sales rank data may not be the most efficient since it uses both regular expression matching and string parsing. It would probably be more efficient to use a regular expression to both locate and extract the desired data, as I suspect is possible. As I'm not a regular expressions expert, I took the easy way out and combined regular expression matching and parsing. If you think you know a more efficient way to retrieve the data, feel free to let me know at graymad@aspalliance.com, and I'll post your suggestion here, along with your name. AcknowlegementsMy thanks to Ronda Pederson and Brad Kingsley for their assistance in reviewing this article.
Our sponsors support AspAlliance, so please support them!
| |||||||||||||||
| Copyright © 2000-2003 ASPAlliance.com Page Rendered at
11/7/2009 9:25:53 PM |
|||||||||||||||