Parsing chunks of XML documents with JAXB

Parsing chunks of XML documents with JAXB
Photo by vackground.com / Unsplash

Jaxb is a great java library for mapping XML documents to Java objects and vice versa. But how can Jaxb be used to parse large XML documents?
The Unofficial JAXB Guide contains a small section which provides some useful information about this topic.

Assume we have a xml document similar to the following:

<Example id="10" date="1970-01-01" version="1.0">
   <Properties>...</Properties>
   <Summary>...</Summary>
   <Document id="1">...</Document>
   <Document id="2">...</Document>
   <Document id="3">...</Document>
</Example>

Now I want to unmarshal the Example element into the corresponding Example object. If I do so the whole XML document gets unmarshalled. If the XML document contains hundreds of thousands of Document elements it will consume a huge amount of memory. But at a certain point, I’m only interested in the Example element with its Properties and Summary element. The Document elements can be parsed by chunks.

To reach that goal I use virtual infosets like stated in the JAXB Guide. Therefore I created a simple class named ParitalXmlEventReader which is of type XmlEventReader and delegates all method calls to a parent reader. As a constructor argument it takes a QName of an element. If the reader finds the first start element of that type it closes the parent element by returning the EndElement event. So the xml document above will look like that to the caller of the reader:

<Example id="10" date="1970-01-01" version="1.0">
  <Properties>...</Properties>
  <Summary>...</Summary>
</Example>

Copy

As the parent reader is still located at the first Document start element we can use the same reader to parse the document elements

The following code demonstrates the use of the PartialXmlEventReader:

@Test
public void testChunks() throws JAXBException, XMLStreamException {

  final QName qName = new QName("Document");

  InputStream in =   getClass().getResourceAsStream("example.xml");
  if(in == null)
    throw new NullPointerException();

  // create xml event reader for input stream
  XMLInputFactory xif = XMLInputFactory.newInstance();
  XMLEventReader reader = xif.createXMLEventReader(in);

  // initialize jaxb
  JAXBContext jaxbCtx =   JAXBContext.newInstance(Example.class, Document.class);
  Unmarshaller um = jaxbCtx.createUnmarshaller();

  // unmarshall the Example element without parsing the document elements
  Example example = um.unmarshal(new PartialXmlEventReader(reader, qName),
Example.class).getValue();

  assertNotNull(example);
  assertEquals("My Properties",   example.getProperties());
  assertEquals("My Summary", example.getSummary());
  assertNull(example.getDocument());

  Long docId = 0l;
  XMLEvent e = null;

  // loop though the xml stream
  while( (e = reader.peek()) != null ) {

    // check the event is a Document start element
    if(e.isStartElement() &&     ((StartElement)e).getName().equals(qName)) {

      // unmarshal the document
      Document document = um.unmarshal(reader,   Document.class).getValue();

    assertNotNull(document);
    assertEquals(++docId, document.getId());

    } else {
      reader.next();
    }
  }
  assertEquals(new Long(10), docId);
}

You can find the source code of the PartialXmlEventReader here.