Author: Michael Gonzalez
Frequently Answered Questions
Miscellaneous

File Text Search & Replace Utility
HTML Text Extraction using innerText
E-mail (CDONTS.NewMail) Sample Code
SQL Server 7.0/2000

Incorporating ASP and SQL Server
100's of T-SQL Scripts
Don't Use @@ERROR with UPDATE Statements
Exporting Tables to Text Files
Creating SQL Server Databases
ASP (SQL) Query Analyzer
Increasing SQL Server Performance with Indexes
Distributed SQL Server Transactions & Queries
COM/COM+ Development
What is COM?
Isn't ActiveX and COM the same?
How can Components benefit my ASPs?
Am I using COM Components now?
How do I use COM Components in my ASPs?
Creating your First COM Component
Creating a COM Component that uses ASP Intrinsic Objects
Creating a COM Component to access an MS-Access Database
MTS Component Template
MSMQ Component Template / Example

HTML Text Extraction using innerText


So, what is innerText? It's an Internet Explorer 4+ DOM (Document Object Model) property that eliminates any HTML or scripting code from the BODY of an HTML document (such as .php, .asp, .htm, etc.) and extracts only the innerText. The innerText property does not only extract text from the entire BODY section, you can target individual BODY tags as well (such as FONT, A, DIV, etc.).


So, how did we get this innerText into that text box above? Via a hidden input form field. Let me show you how it's done: <html> <head></head> <body onLoad="LoadTextIntoField();"> <script language="JScript"> function LoadTextIntoField() { window.document.innerTextForm.textsamp.value = window.document.body.innerText; } </script> <form name="innerTextForm" action="htmltext.asp" method="post"> <input type="hidden" name="text"> <input type="submit" value="Click here to view the innerText of this ASP Page"> <textarea cols="60" rows="10" name="textsamp"></textarea> </form> </body> </html> innerTextForm is the name of the form that contains the hidden field text. In order to gain access to the value in text, we must first know the name of it's parent, innerTextForm. So, we can place data into the text field simply by setting it's value property: window.document.innerTextForm.text.value.

As you can see, we make use of the BODY's onLoad event handler which calls the LoadTextIntoField function. This function contains the commands necessary to get the data obtained by innerText into the form field text. Why do we need to use the BODY's onLoad event? Because, we need to allow the entire HTML document to load before we can use innerText to extract it's data. If we use innerText early in the document loading process, we risk not getting to all of the data in that document.

Therefore, we make use of the onLoad event. This way, after all of the HTML document is loaded, and ONLY after, do we call the LoadTextIntoField function to populate our hidden form field with innerText data.

Once the text input field is populated with innerText data, we can use a submit button to send this data to an ASP page. After that, we can use code like that below to catch the form data and use however we'de like: <!-- In ASP VBScript --> If Request.Form("text") <> "" Then 'Do something with the innerText data here... End If <!-- In ASP JScript --> if (Request.Form("text") > "") { //Do something with the innerText data here... }
There are downsides to using innerText:
innerText is NOT a server-side VBScript, PerlScript, or JScript property. It can only be used with Internet Explorer 4 or later. That's it! There is a way for you to check the client's browser type and version at the client-side: <!-- Using JavaScript (Client-Side-Only) --> <script language="JavaScript"> if (window.navigator.appName == "Microsoft Internet Explorer") { appver = window.navigator.appVersion; if (appver.indexOf("4.") > 0 || appver.indexOf("5.") > 0) { // Go ahead an populate the hidden input field with the innerText data here... window.document.innerTextForm.text.value = window.document.body.innerText; } else { alert("This innerText tutorial will only work with Internet Explorer 4 or 5!"); } } else { alert("This innerText tutorial will only work with Internet Explorer 4 or 5!"); } </script>
Note: Please notice that I am using the language attribute of the SCRIPT tag as JavaScript. Why? Because JavaScript is supported by both Internet Explorer and Netscape (as well as other browsers). If you want to incorporate version-checking code into your HTML documents (including ASP pages), you want to make sure to use a compatible scripting language such as JavaScript. We use the JScript language attribute with the rest of our code because only Internet Explorer supports JScript, not Netscape or any other browsers. And, since we are dealing with innerText, only Internet Explorer such be involved!

As you can see, we assign the variable appver the data provided by the browser's scripting engine with the browser version. We already know that the browser is running Microsoft Internet Explorer in the first if statement, all we have to do is use indexOf to determine whether we can find the text '4.' or '5.' in that variable. If we do, then the browser is most likely running IE 4 or 5.

The problem with detecting browser version compatibility in this way is when a browser maker creates a newer version. For example, if Mircosoft created Internet Explorer 6.0, we would have to add code, in much the same way we did above, to allow for our innerText code to work with that version as well. We don't want ONLY version 4 or 5 of IE to be able to utilize our innerText capabilities!
There are also upsides to using innerText:
Unlike VBScript's or JScript's Replace() function and Regular Expressions, innerText efficiently and quickly eliminates ANY HTML and DOM objects from the HTML document's BODY section. When using the Replace() function and Regular Expressions, you normally have to specify what characters, or groups of characters, to look for. Once the characters are found, you can use the Replace() function to replace those characters with something else.

For example, we could use a Regular Expression that looks for closing (<) and ending (>) HTML tags in an HTML document and, using Replace(), replace those tags, or even their contents, with something else, such as an empty string. This is one way to eliminate HTML tags from a text string. This would ONLY work with HTML tags, however, not with scripting code.

So, in using this method, the following HTML data: <font size="2" face="arial" color="black">Hello! I'm just here for the tutorial...</font> ...would be changed to: Hello! I'm just here for the tutorial... If we only had an HTML document that contained pure tags, no problem, this method of searching for and replacing text would work just fine. But, hah, what if we had some script in our page? Take a look at the following code: <script language="JavaScript"> // Hi, I'm just here to prove Mike's point...duhh.... alert("Hello! I'm a pop-up message box - just to annoy you!"); </script> Using the Replace() function and Regular Expressions, the result would look like: // Hi, I'm just here to prove Mike's point...duhh.... alert("Hello! I'm a pop-up message box - just to annoy you!"); I don't think any Web programmer would want to include non-standard text into whatever project he or she is working on. So, using replace() and regular expressions has it's downsides. The only real advantage is that you can use it at the server-side - which means you don't have to worry about whether or not a specific browser supports innerText!
Comments & Questions Form

Send It!