2009-03-26

NetTool, a great HTTP request handling tool

After being annoyed of the OS X font bug from http4e I searched for a free http request analysis tool, where you can freely manipulate the request. I needed this as I'm currently developing a RESTful webservice which delivers XML and json to clients. The clients had problems in some browsers to get the ajax requests done. So I wanted to check if the right content type and charset was returned if any Accept headers had beend set on the request.

I stumbled upon NetTool and this worked like a breeze. There are only two caveats:

  • The control characters don't match their OS X counterparts, so I have to use CTRL+C instead of COMMAND+C to copy values from the text fields

  • The tunnelling from port 80 does only work as sudo due to user access rights (of course that's the right thing to do, but coming from windows originally I stepped into the trap at first)

2009-03-07

Using JTidy to clean up non valid html pages

A couple of days ago I tried to modify a html page using a DOM. When I tried to convert the page into a Document instance the parser threw some SAXExceptions, complaining about the structure of the document, sort of "this tag needs to be closed" and the like.
The source was html output generated from Docbook. I neither had the time nor the intent to mess around cleaning up the generated html, but could remember there was something called Tidy.
So I searched for a Java library, and there it was. JTidy, looking like an unmaintained project, but being the right tool to clean up a html page and transform it into valid xhtml.
The API is pretty straight forward.

This is the implementation for converting (non-valid) html to a Document instance:

// Create instance
final Tidy tidy = new Tidy();

// Remove presentational clutter (don't really know
// what exactly that does, but sounds great ;-)
tidy.setMakeClean( true );

// Use XHTML output
tidy.setXHTML( true );

// Make document readable by indenting the elements
tidy.setSmartIndent( true );

// The html document received by a get request
final String s = ...;

// Converting the page into a Document instance
final Document document = tidy.parseDOM( new ByteArrayInputStream( s.getBytes() ) , null );

That's it, by now you have your html as a Document instance that you can freely manipulate.
The only thing I noticed was that the method node.setTextContent() does not work. But you can use node.appendChild( document.createTextNode( ... ) ), that does what you want.

The second part is about writing your Document to a string:

// Create a stream to write the output to
final ByteArrayOutputStream outStr2 = new ByteArrayOutputStream();

// Write modified Document to an output stream
tidy.pprint( document , outStr2 );

// Create a StringBuilder
final StringBuilder builder = new StringBuilder();

// Write output stream content to string builder
builder.append( new String( outStr2.toByteArray() , "UTF-8" ) );

// Create String
final String validXHTML = builder.toString();

At the end of the block you have your valid XHTML in a String.

Followers