HTML Cleaning Parser
Jump to navigation
Jump to search
See: Clearning Parser, HTML File, NekoHTML, HtmlCleaner, TagSoup, jTidy.
References
2008
- http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
- I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like any markup I’d create. Missing end tags and other broken syntax throws a wrench into the situation. … Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. … However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. … One drawback to HtmlCleaner is that it’s not available in a Maven repository. Sometimes NekoHTML may be easier to use for this reason.