It provides the following three important methods on each event where we can write custom logic to take specific action at each events:
- startDocument() and endDocument() – Method called at the start and end of an XML document.
- startElement() and endElement() – Method called at the start and end of a document element.
- characters() – Method called with the text contents in between the start and end tags of an XML document element.
I am going to use my existing code from my old blog xml-parsing-using-saxparser and updating it for this purpose. The final code is available on github project java-read-big-xml-to-csv
Java HUGE XML to CSV - project structure |
How to Import/Run:
Its a simple maven project(with no dependencies). You can import it into your IDE or use command line to compile and run.If you plan on using Command Line, to compile and create a runnable jar file, go to the root of the project and run mvnw clean package .
Then you can run the executable as following:
java -jar target\xmltocsv-FINAL.jar C:\folder\input.xml C:\folder\output.csv
The code:
SaxParseEventHandlerSaxParseEventHandler class takes the RecordWriter as constructor parameter
public SaxParseEventHandler(RecordWriter<Book> writer) {
We create new book record on startElement event
public void startElement(String s, String s1, String elementName, Attributes attributes) { /* handle start of a new Book tag and attributes of an element */ if (elementName.equalsIgnoreCase("book")) { //start bookTmp = new Book();
and we write the parsed book data to file on endElement() event.
public void endElement(String s, String s1, String element) {
if (element.equals("book")) { //end
writer.write(bookTmp, counter);
RecordWriter:
Its a simple wrapper for FileWriter to write content to file. We are currently writing T.toString() to file.
public void write(T t, int n) throws IOException { fw.write(t.toString()); if (n % 10000 == 0) { fw.flush(); } }
Main:
Its the main 'launcher' class
SAXParserFactory factory = SAXParserFactory.newInstance();
try (RecordWriter<Book> w = new RecordWriter<>(outputCSV)) {
SAXParser parser = factory.newSAXParser();
parser.parse(inputXml, new SaxParseEventHandler(w));
}
Results at 16GB RAM, Core i5, 6MB L3 cache, SSD | Windows Machine
Max RAM usage: 190MB
Time Taken:
For the file big2.xml with size 118MB
- JDK8 - 8-9 sec
- JDK 11 - 6-7 sec
- JDK 14 - 5 sec
big3.xml with size 6.58GB takes about 2 minutes