Friday, March 6, 2009

HTML Parsing in JAVA

In this article, I will take you a ride of how we can make our power of JAVA useful for parsing the HTML pages. The API I am using to parse a HTML page is available with JDK 6 documentation. The link is as follows: http://java.sun.com/javase/6/docs/api/index.html?javax/swing/text/html/parser/Parser.html

Before moving into the details of the code, I will explain how it moves into parsing.

HTMLParser class extends HTMLEditorKit class of the javax.swing.text.html package. The class (HTMLParser) returns the parser to the Main Class which parses the content of URL. The URL content can be extracted using InputStreamReader class which takes the input in terms of HTTPURLConnection.

The Main Class calls the parse method which in turn calls the parser method of the HTMLEditorKit class, which will again call handleStartTag(), handleSimpleTag() etc method.

There are three classes I am using to parse the HTML page.

------------------------------------------------------------------------
package htmlparsing;

import javax.swing.text.html.HTMLEditorKit;

public class HTMLParser extends HTMLEditorKit {

public HTMLEditorKit.Parser getParser() {
return super.getParser();
}

}

------------------------------------------------------------------------
package htmlparsing;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;

public class HTMLParserCallback extends HTMLEditorKit.ParserCallback {
boolean flag = false;
public HTMLParserCallback() {
}

public void handleStartTag(HTML.Tag tag, MutableAttributeSet attributes, int position) {

if (tag == HTML.Tag.A) {
Object attribute = attributes.getAttribute(HTML.Attribute.HREF);
if (attribute != null) {
// Do Anything
}
}
}

public void handleEndTag(HTML.Tag tag, int position) {
if (tag == HTML.Tag.A) {
// Do Anything
}
}

public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet attributes, int position) {
String entireInfo = new String("");
if (tag == HTML.Tag.IMG) {
Object attribute = attributes.getAttribute(HTML.Attribute.ALT);
if (attribute != null) {
// Do Anything
}
}

}
------------------------------------------------------------------------

package htmlparsing;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.net.HttpURLConnection;
import javax.swing.text.html.HTMLEditorKit;

public class Main {

public static void main(String[] args) {
HTMLParser kit = new HTMLParser();
HTMLEditorKit.Parser parser = kit.getParser();
try {
URL url = new URL("http://www.javaworld.com");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
Reader reader = new InputStreamReader(connection.getInputStream());
HTMLEditorKit.ParserCallback callback = new HTMLParserCallback();
parser.parse(reader, callback, true);
reader.close();
} catch (Exception e) {
System.err.println(e);
}
}
}


Hope You all will find it useful. Any suggestion/feedback would be much appreciated.

No comments:

Total Pageviews