|
In the course of improving this website's search engine, I wrote a routine that would
extract the text from an article given a URL,strip out the HTML,
and then convert all of the
white space and carriage returns into single spaces.This was done to compress the size of the
text involved, which was then stored in the database and used for full-text searches.
In order to strip out all of the HTML tags from the document, I used regular
expressions (with some help from
Remas).
My code was written using ASP and VBScript (version 5.5 for RegExp support), but I'll
show how it can easily be done in ASP.NET.
First, let look at the source code of the ASP function:
Function RemoveHTML( strText )
Dim RegEx
Set RegEx = New RegExp
RegEx.Pattern = "<[^>]*>"
RegEx.Global = True
strText = Replace(LCase(strText), "<br>", chr(10))
RemoveHTML = RegEx.Replace(strText, "")
End Function
Note: This fucntion will return all lower case output. If you want to maintain the case of
your content, remove the LCase statement and use 4 different replaces, one each for
<br>, <Br>, <bR>, and <BR>.
Ok, now let's see how it would be done in ASP.NET. Just to make this article
more interesting, I'll list the code in all three standard languages
of .NET: VB, C#, and JScript.
1 <%@ Import Namespace="System.IO" %> 2 <%@ Import Namespace="System.Text.RegularExpressions" %> 3 <script language="C#" runat="server"> 4 5 void SubmitBtn_Click(Object sender, EventArgs e) { 6 String strInput; 7 String strOutput; 8 strInput = Text1.Text; 9 strOutput = Regex.Replace(strInput, "<[^>]*>", " "); 10 output.Text = strOutput; 11 output_raw.Text = Server.HtmlEncode(Text1.Text); 12 } 13 14 </script> 15 <html> 16 <body> 17 <a href="/stevesmith/articles/removehtml.asp">Return To Article</a> 18 <form runat="server"> 19 <table width="100%"> 20 <tr> 21 <td valign="top" rowspan="2"> 22 Add HTML Formatted Text<br> 23 <asp:TextBox TextMode="multiline" id="Text1" width="200px" 24 height="80px" runat="server" /><br> 25 <asp:Button OnClick="SubmitBtn_Click" Text="Format Text" Runat="server"/> 26 </td> 27 <td valign="top"> 28 Unformatted Text: 29 </td> 30 <td valign="top" 31 <pre><asp:label id="output_raw" runat="server" /></pre> 32 </td> 33 </tr> 34 <tr> 35 <td valign="top"> 36 HTML-stripped Output: 37 </td> 38 <td valign="top"> 39 <pre><asp:label id="output" runat="server" /></pre> 40 </td> 41 </tr> 42 </table> 43 </form> 44 </body> 45 </html>
1 <%@ Import Namespace="System.IO" %> 2 <%@ Import Namespace="System.Text.RegularExpressions" %> 3 <script language="VB" runat="server"> 4 5 Sub SubmitBtn_Click(sender As Object, e As EventArgs) 6 Dim strInput As String 7 Dim strOutput As String 8 strInput = Text1.Text 9 strOutput = Regex.Replace(strInput, "<[^>]*>", " ") 10 output.Text = strOutput 11 output_raw.Text = Server.HtmlEncode(Text1.Text) 12 End Sub 13 14 </script> 15 <html> 16 <body> 17 <a href="/stevesmith/articles/removehtml.asp">Return To Article</a> 18 <form runat="server"> 19 <table width="100%"> 20 <tr> 21 <td valign="top" rowspan="2"> 22 Add HTML Formatted Text<br> 23 <asp:TextBox TextMode="multiline" id="Text1" width="200px" 24 height="80px" runat="server" /><br> 25 <asp:Button OnClick="SubmitBtn_Click" Text="Format Text" Runat="server"/> 26 </td> 27 <td valign="top"> 28 Unformatted Text: 29 </td> 30 <td valign="top" 31 <pre><asp:label id="output_raw" runat="server" /></pre> 32 </td> 33 </tr> 34 <tr> 35 <td valign="top"> 36 HTML-stripped Output: 37 </td> 38 <td valign="top"> 39 <pre><asp:label id="output" runat="server" /></pre> 40 </td> 41 </tr> 42 </table> 43 </form> 44 </body> 45 </html>
//coming soon
|
| C# |
VB |
JScript |
The full source of the example is shown. You can run the example and see how it
works.
Other useful links on regular expressions:
|