Skip to content

February 27, 2011

To sanitize user content, use an HTML parser

It is especially important, if you allow any HTML at all in user-submitted content, to sanitize that content by actually parsing the HTML and filtering it for any tags or attributes you wish to exclude. If you fail to do so, your site may be vulnerable to XSS (cross-site scripting) attacks.

Q: “But isn’t it overkill to parse the HTML?  Can’t I use other techniques, such as regular expressions or simple string replacement, to filter out dangerous tags and attributes?”  A: No, and I’ll explain why.

You might invest considerable time creating a seemingly bullet-proof algorithm for sanitizing user content, but have you accounted for:

  • every vulnerable attribute? 
  • even ones you may have never heard of, like dynsrc?
  • malicious URLs included in CSS markup?
  • malicious code inside CSS expressions?
  • malicious code inside HTML comments?
  • malicious code inside a CDATA block?
  • the vbscript namespace?
  • base64-encoded content?
  • unicode-escaped content?
  • malformed tags?
  • malformed attributes?
  • exploits that involve XML namespaces?
  • exploits that involve server-side includes?

Your safest bet is to put untrusted content through an HTML parser and apply a whitelist for allowable tags and attributes.  This is the best way to defeat cross-site scripting attacks.  As a side-benefit the parser will also correct for missing closing tags, ensuring your page’s layout will not become FUBAR’d by user content.

After comparing the various open-source HTML parsers available for Java, I’ve decided I like Jsoup the best.  In addition to providing parsing, fully-customizable whitelisting, and closing tag correction, Jsoup also provides a jQuery-like selector syntax for navigating HTML documents.

So in conclusion, sanatize your untrusted content against cross-site scripting attacks by using an HTML parser.  The extra overhead of parsing is worth the peace of mind.

In the coming days I will write about a small project involving Jsoup, so watch for that if you are interested.

Read more from Programming

Share your thoughts, post a comment.


Note: HTML is allowed. Your email address will never be published.

Subscribe to comments