JournURL powered

Big Damn Heroes (MXBlogspace)


Blog Info

Navigation

<< December 2004 >>
S M T W T F S
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Recent Entries

Stuff I Read

Alternative Formats

Search

December 29, 2004

Manipulating HTML with CFMX and jTidy

Someone of the CFDJ list recently had a problem that went something like this... he had a bunch of user-created HTML locked up in his database. When rendering the HTML, he needed to be able to pull out all the <h*> elements and use their contents to populate an @id attribute. He then wanted to add an ordered list to the top of the page, containing links to those IDs.

So he had this:

...      
<h1>This is a heading</h1>      
<p>This is a paragraph</p>      
<h2>This is another heading</h2>      
<p>This is another paragraph</p>      
...      

...and wanted to turn it into this:

...      
<ol>      
  <li>      
    <a href="#This-is-a-heading">      
      Heading 1      
    </a>      
  </li>      
  <li>      
    <a href="#This-is-another-heading">      
      Heading 2      
    </a>      
  </li>      
</ol>      
<h1 id="This-is-a-heading">      
  This is a heading      
</h1>      
<p>This is a paragraph</p>      
<h2 id="This-is-another-heading">      
  This is another heading      
</h2>      
<p>This is another paragraph</p>      
...      

A number of folks suggested a series of ever-more-complicated regular expressions to address the problem, but I've found it much easier to just put jTidy and CFMX's native XML tools to work.

  1. Right up front, go download a copy of jTidy and drop it into your classpath.
  2. Borrow Greg's makexHtmlValid() function and add it to a component called jtidy.cfc.
  3. For parsing insurance, add jTidy.setXmlOut(true); to the function, right after the other jTidy.set* statements.
  4. In the same directory with jtidy.cfc, create this file:

<cfsavecontent variable="page">     
<h1>Heading Number <i>1</i></h1>      
<p>this is a paragraph</p>      
<h2>Heading Number 2</h2>      
<p>Another paragraph goes here.</p>      
</cfsavecontent>      

<cfinvoke component="jtidy"       
  method="makexHtmlValid"       
  strtoparse="#page#"      
  returnvariable="content" />      

<cfset myxml = XmlParse(content) />      
<cfset myheadings = XmlSearch(myxml,      
  "//*[starts-with(name(),'h') and string-length(name()) = 2]") />      

<cfdump var='#myheadings#'>      

<cfloop index="i" from="1" to="#ArrayLen(myheadings)#">      
  <cfset dummy = ToString(myheadings[i]) />      
  <cfset dummy = REReplaceNoCase(dummy, "<#myheadings[i].xmlname#[^>]*>","","ONE") />      
  <cfset dummy = ReplaceNoCase(dummy, "</#myheadings[i].xmlname#>", "", "ONE") />      
  <cfset dummy = Replace(dummy, " ", "-", "ALL") />      
  <cfset myheadings[i].xmlattributes.id = Trim(REReplace(dummy, "<[^>]*>", "", "All"))>      
</cfloop>      
       
<cfdump var="#myxml#">      

To output the results as HTML, just use:

<cfoutput>#ToString(myxml)#</cfoutput>  

In my opinion, the result is easier to understand, and a whole lot more flexible than using a regex or three. How about you?

UPDATE: I was missing a Trim() in the id-setting code, which caused unwanted artifacts to show up in the final HTML.

Nope, give me REFind()!
Absolutely, I love that XmlParse()!
What the devil are you talking about?

12-29-2004 05:53:44PM - Permalink - Comment [7] - Trackback
category: XML
related topics: (CFMX) (XHTML) (jTidy) (CFDJ) (list)