Java PDF to Image Conversion using PDFBox

Here are few snippets of code that you can use to convert a PDF file to images eg: tiff, png, multi-page tiff. It uses a open source library PDFBox version 3

The Apache PDFBox® library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.

PDFBox comes with following features:

  • Extract Unicode text from PDF files.
  • Split a single PDF into many files or merge multiple PDF files.
  • Extract data from PDF forms or fill a PDF form.
  • Validate PDF files against the PDF/A-1b standard.
  • Print a PDF file using the standard Java printing API.
  • Save PDFs as image files, such as PNG or JPEG.
  • Create a PDF from scratch, with embedded fonts and images.

Convert PDF to PNG/JPG/BMP etc:

git reset/revert to previous commit

Sometimes we "accidentally" merge something that's not ready for release/qa  and need to rollback/reset/revert the commits to a cleaner state.

Here's how you can safely reset your git to previous commit:

1) find the older commit id to where you want to reset.

2) do git status to confirm there are no uncommitted changes on the current branch  

3) reset to a commit,

git reset --hard COMMIT_ID
git reset HEAD@{1}
git add .
git commit -m "reverting to commit COMMIT_ID"

4) cherry pick or patch any other commit that you like to add


How to make integration tests faster without @DirtiesContext

If your only excuse to use @DirtiesContext is to re-initialize database between tests, then this blog post is for you.

Spring's @DirtiesContext is very useful to make your integration tests faster by indicating the underlying Spring ApplicationContext is modified and forcing it to reload the context (not the whole application) between tests.

This is particularly useful when we want to reset the state of database between tests. This allows us to group many tests together in a same class so that we can easily share some test data, assertions etc. So your test code would look this this:

@SpringBootTest
@DirtiesContext(classMode = AFTER_EACH_TEST_METHOD)
class SomeTest {

@Autowired SomeService serv;
@Autowired DataCreator dc;

@Test
void someTest1() {
//prepare data
dc.initData();
//run test
serv.modifySomeData();
//verify
}

@Test
void someTest2() {
//prepare data
dc.initData(2);
//run test
serv.modifyAnotherData();
serv.doSomethingElse();
//verify
}
}

Problem with @DirtiesContext

This has one major flaw though. By resetting the ApplicationContext, it would also drop all the databases and need to re-create them after every test. So, if the only reason to use @DirtiesContext is to reset the database, then there's a better way of doing this.

Is deleting records from all table not a good solution?

Correct. Deleting records is still a slow operation. When you have a good amount to test data, the @DirtiesContext can sometimes becomes faster than deleting the records one by one. Also, there's referential integrity between tables that enforces you to follow a series of deletes to first delete records from child tables and go up. 

Table Truncate to the rescue

Truncate (truncate table X) is faster way to delete records from database than deleting records. If we already know the list of table and join tables then we can loop through them and execute the TRUNCATE TABLE command. One advantage of this approach is that we can skip truncating some lookup tables that will always have constant records.

How to Truncate H2 Tables:

Note that before we execute the TRUNCATE TABLE we should disable the referential integrity.
em.createNativeQuery("SET REFERENTIAL_INTEGRITY FALSE").executeUpdate();
for (String t : tableNames) {
em.createNativeQuery("TRUNCATE TABLE " + t ).executeUpdate();
}
em.createNativeQuery("SET REFERENTIAL_INTEGRITY TRUE").executeUpdate();

Integrating everything together (Sample App @ GitHub):

TestDataManager to grab list of tables, truncate and populate the test data:

import gt.app.DataCreator;
import gt.app.config.MetadataExtractorIntegrator;
import lombok.RequiredArgsConstructor;
import org.hibernate.boot.Metadata;
import org.hibernate.mapping.Collection;
import org.hibernate.mapping.PersistentClass;
import org.springframework.beans.factory.InitializingBean;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

import javax.persistence.EntityManager;
import java.util.ArrayList;
import java.util.List;

@Service
@RequiredArgsConstructor
public class TestDataManager implements InitializingBean {
final EntityManager em;
final DataCreator dataCreator;

private final List<String> tableNames = new ArrayList<>(); //shared

@Override
public void afterPropertiesSet() {
Metadata metadata = MetadataExtractorIntegrator.INSTANCE.getMetadata();

for (Collection persistentClass : metadata.getCollectionBindings()) {
tableNames.add(persistentClass.getCollectionTable().getExportIdentifier());
}

for (PersistentClass persistentClass : metadata.getEntityBindings()) {
tableNames.add(persistentClass.getTable().getExportIdentifier());
}
}

@Transactional
public void truncateTablesAndRecreate() {
truncateTables();
dataCreator.initData();
}

void truncateTables() {
em.createNativeQuery("SET REFERENTIAL_INTEGRITY FALSE").executeUpdate();
for (String tableName : tableNames) {
em.createNativeQuery("TRUNCATE TABLE " + tableName).executeUpdate();
}
em.createNativeQuery("SET REFERENTIAL_INTEGRITY TRUE").executeUpdate();
}
}

Call the truncateTablesAndRecreate() @BeforeEach test

@SpringBootTest
class WebAppIT {

@Autowired
TestDataManager tdm;

@BeforeEach
void resetDB(){
tdm.truncateTablesAndRecreate();
}

 

This is awesome, how about other databases?

To get a List of tables: You can rely on Hibernate's Metadata to grab list of table/join tables

Truncate command: Each database has different command to set the referential integrity and truncate. This above example is for H2.

You can do the following for MySQL


//for MySQL:
em.createNativeQuery("SET @@foreign_key_checks = 0").executeUpdate();
for (String tableName : tableNames) {
em.createNativeQuery("TRUNCATE TABLE " + tableName).executeUpdate();
}
em.createNativeQuery("SET @@foreign_key_checks = 1").executeUpdate();

For other populate databases, you can take reference from Hibernate's Test code itself. They are available at: Hibernate GitHub. The *Cleaner classes have code snippet that describes how to perform truncate on each database.

How fast the tests ran after this update?

Very fast! In a big application (Spring Boot, H2) with 62 tables and 315 tests that were using @DirtiesContext, we reduced our test execution time from 18 minutes to 4minutes.

Full working example is available at this sample app.

JPA/Hibernate get find all Table and Column metadata

How to use Hibernate Metadata to find All columns and tables

Getting the Hibernate's Metadata object into the Spring application is tricky. Luckily, Hibernate Provides an Integrator API (org.hibernate.integrator.spi.Integrator) that we can use to customize/interact with Hibernate. Its the same API that Caching, Bean Validation etc library uses to integrate with Hibernate. 

Also, Spring Boot provides HibernatePropertiesCustomizer to link the 'hibernate.integrator_provider' property to Integrator implementation.

Here's how we can configure the Hibernate Integrator to read metadata.

Step 1) create a extractor implementation

This class is a singleton class and does absolutely nothing other than exposing the Metadata and Database. Since this is singleton this class and the database, metadata objects can be statically accessed using MetadataExtractorIntegrator.INSTANCE

import lombok.Data;
import org.hibernate.boot.Metadata;
import org.hibernate.boot.model.relational.Database;
import org.hibernate.engine.spi.SessionFactoryImplementor;
import org.hibernate.integrator.spi.Integrator;
import org.hibernate.service.spi.SessionFactoryServiceRegistry;

@Data
public class MetadataExtractorIntegrator implements Integrator {

public static final MetadataExtractorIntegrator INSTANCE =
new MetadataExtractorIntegrator();
private Database database;
private Metadata metadata;

@Override
public void integrate(Metadata metadata, SessionFactoryImplementor sf,
SessionFactoryServiceRegistry sr) {
this.database = metadata.getDatabase();
this.metadata = metadata;
}

@Override
public void disintegrate(SessionFactoryImplementor sf,
SessionFactoryServiceRegistry sr) {
}
}

Step 2) Register the Spring Hibernate Customizer


import org.hibernate.jpa.boot.spi.IntegratorProvider;
import org.springframework.boot.autoconfigure.orm.jpa.HibernatePropertiesCustomizer;
import org.springframework.context.annotation.Configuration;

import java.util.List;
import java.util.Map;

@Configuration
public class HibernateConfig implements HibernatePropertiesCustomizer {
@Override
public void customize(Map<String, Object> hibernateProps) {
hibernateProps.put("hibernate.integrator_provider",
(IntegratorProvider) () -> List.of(MetadataExtractorIntegrator.INSTANCE));
}
}

Step 3) Use MetadataExtractorIntegrator.metadata to extract the metadata

This is a simple Spring Component that uses Metadata.getCollectionBindings and Metadata.getEntityBindings to extract the tables, columns, PK and type

import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.hibernate.boot.Metadata;
import org.hibernate.mapping.*;
import org.springframework.beans.factory.InitializingBean;
import org.springframework.stereotype.Component;

import javax.persistence.EntityManager;
import java.util.Iterator;

@Component
@RequiredArgsConstructor
@Slf4j
public class DBMetadataReader implements InitializingBean {
final EntityManager em;

@Override
public void afterPropertiesSet() {

Metadata metadata = MetadataExtractorIntegrator.INSTANCE.getMetadata();

//Collection tables
for (Collection c : metadata.getCollectionBindings()) {
log.info("Collection table: {}", c.getCollectionTable().getQualifiedTableName());
for (Iterator<Column> it = c.getCollectionTable().getColumnIterator();
it.hasNext(); ) {
Column property = it.next();
log.info(" {} {} ", property.getName(), property.getSqlType());
}
}

//all entities
for (PersistentClass pc : metadata.getEntityBindings()) {
Table table = pc.getTable();

log.info("Entity: {} - {}", pc.getClassName(), table.getName());

KeyValue identifier = pc.getIdentifier();

//PK
for (Iterator<Selectable> it = identifier.getColumnIterator();
it.hasNext(); ) {
Column column = (Column) it.next();
log.info(" PK: {} {}", column.getName(), column.getSqlType());
}

//property/columns
for (Iterator it = pc.getPropertyIterator();
it.hasNext(); ) {
Property property = (Property) it.next();

for (Iterator columnIterator = property.getColumnIterator();
columnIterator.hasNext(); ) {
Column column = (Column) columnIterator.next();
log.info(" {} {}", column.getName(), column.getSqlType());
}
}
}
}
}

Example Project:

Checkout my sample github project and the source for working example



SpringFox Swagger with Groovy - fixing slow startup / heap space error

Story of how we reduced app startup time by 20x and heap usage by 7x

Few months ago, we started developing a Spring Boot app in Groovy. It is a typical app with ton of CRUD logic and few REST endpoints. It also had a Swagger UI configured with SpringFox Swagger. Though it was taking ~35 seconds to start and taking ~700MB heap space, no-one in the team really cared about the slowness and memory usage of the application.

As our application grew bigger, we added more DB tables and endpoints which resulted in further slowness and required more heap space. It was ~3 minutes startup with almost 1.4GB heap at its worst state. Again no-one really bothered to fix it... until the application randomly failed to start and started crashing randomly with OutOfMemoryError.


Problem Identification: Thread Dump using IntelliJ

The IntelliJ thread-dump showing the main thread always stuck at springfox

I started the application in debug mode in IntelliJ and took thread dump several times to see what's the main thread is doing. I found that in all thread dumps, the main thread was stuck around springfox.documentation.spring.web.readers.parameter.ModelAttributeParameterExpander.expand() method.

So, I dig into the ModelAttributeParameterExpander#expand() method was able to quickly confirm that it was trying to expand the fields from the @ModelAttribute to render into the Swagger UI - which is exposed in a JSON endpoint host:port/v3/api-docs.  

Once the application started (after about 3 mins), I opened up the host:port/v3/api-docs endpoint locally and amazed by the size of JSON response. It WAS 1240MB! That explained why Swagger UI used to take longer. 


Problem Isolation:

To isolate the problem, I commented out all the endpoints that's using @ModelAttribute and just like I expected, the app started within 9 seconds, took about 200MB heap, the swagger UI loaded within one second, the host:port/v3/api-docs endpoint response was about 20KB.

@GetMapping("/test")
void test(@ModelAttribute A a) { //commented out all the endpoints to isolate problem

Then I dig through the code inside ModelAttributeParameterExpander#expand() and I found that its collecting all the fields and properties inside the @ModelAttribute class and recursively calls the expand() method for all the sub-fields and properties.

public List<Compatibility<Parameter, RequestParameter>> expand(ExpansionContext context) {
List<Compatibility<Parameter, RequestParameter>> parameters = new ArrayList();
Set<PropertyDescriptor> propertyDescriptors = this.propertyDescriptors(context.getParamType().getErasedType());
Map<Method, PropertyDescriptor> propertyLookupByGetter = this.propertyDescriptorsByMethod(context.getParamType().getErasedType(), propertyDescriptors);
Iterable<ResolvedMethod> getters = (Iterable)this.accessors.in(context.getParamType()).stream().filter(this.onlyValidGetters(propertyLookupByGetter.keySet())).collect(Collectors.toList());
...
Stream<ModelAttributeField> collectionTypes = attributes.stream().filter(this.isCollection().and(this.recursiveCollectionItemType(context.getParamType()).negate()));
collectionTypes.forEachOrdered((each) -> {
        ....
parameters.addAll(this.expand(childContext)); //nested

I dig further and found that it was grabing the getMetaClass method generated by Groovy and recursively looking at all the get* methods and the parameters inside MetaClass class.

public static class MyClass implements GroovyObject {
//..fields

@Generated
@Internal
@Transient
public MetaClass getMetaClass() { //this was treated as class property

MetaClass: it contains several get* method. All of them were processed by Swagger

public interface MetaClass extends MetaObjectProtocol {
Object invokeMethod(Class var1, Object var2, String var3, Object[] var4, boolean var5, boolean var6);

Object getProperty(Class var1, Object var2, String var3, boolean var4, boolean var5);

void setProperty(Class var1, Object var2, String var3, Object var4, boolean var5, boolean var6);

Object invokeMissingMethod(Object var1, String var2, Object[] var3);

Object invokeMissingProperty(Object var1, String var2, Object var3, boolean var4);

Object getAttribute(Class var1, Object var2, String var3, boolean var4);

void setAttribute(Class var1, Object var2, String var3, Object var4, boolean var5, boolean var6);

void initialize();

List<MetaProperty> getProperties();

List<MetaMethod> getMethods();

ClassNode getClassNode();

List<MetaMethod> getMetaMethods();

int selectConstructorAndTransformArguments(int var1, Object[] var2);

MetaMethod pickMethod(String var1, Class[] var2);
}


Solution:

I found that there were few discussions already in springfox GitHub project and solution was also discussed. The solution is to configure the Docket object to ignore MetaClass class using ignoredParameterTypes() method:

   @Bean
Docket docket(){
new Docket(DocumentationType.OAS_30)
.ignoredParameterTypes(MetaClass)
}

}

Now the application takes about 9 seconds to start and takes about 200MB heap space. That was a great reduction from 3min/1.4GB.


Reproducible example:

The following simple app was taking 10 seconds to start, the http://localhost:8080/v3/api-docs returns 58.5MB JSON response. Once I comment out the test method and rerun it just takes 1.7 seconds. Also, the http://localhost:8080/v3/api-docs returns 5KB JSON response.

package gt.swagger.demo

import org.springframework.boot.SpringApplication
import org.springframework.boot.autoconfigure.SpringBootApplication
import org.springframework.web.bind.annotation.*
import springfox.documentation.swagger2.annotations.EnableSwagger2

@SpringBootApplication
@EnableSwagger2
class SwaggerTestApp {
static void main(String[] args) {
SpringApplication.run(SwaggerTestApp, args)
}
}

@RestController
class Controller {

@GetMapping("/test")
void test(@ModelAttribute A a) {
}

static class A {
String a
}
}

When I add more endpoints with @ModelAttribute or when I nest the 'A' class with class B, the application starts taking a lot of time to start and takes up more heap space

static class A {
String a
B b1, b2, b3, b4, b5 ,b6, b7, b8, b9
}

static class B {
String b
}

The problem can be solved by configuring the following bean:

@Configuration
class SwaggerConfig {

@Bean
Docket docket(){
new Docket(DocumentationType.OAS_30)
.ignoredParameterTypes(MetaClass)
}

}

POM.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.5.2</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>gt.swagg</groupId>
<artifactId>sb25-swagger3</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<properties>
<java.version>11</java.version>
</properties>

<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>io.springfox</groupId>
<artifactId>springfox-boot-starter</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.codehaus.groovy</groupId>
<artifactId>groovy</artifactId>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.codehaus.gmavenplus</groupId>
<artifactId>gmavenplus-plugin</artifactId>
<version>1.11.0</version>
<executions>
<execution>
<goals>
<goal>addSources</goal>
<goal>addTestSources</goal>
<goal>generateStubs</goal>
<goal>compile</goal>
<goal>generateTestStubs</goal>
<goal>compileTests</goal>
<goal>removeStubs</goal>
<goal>removeTestStubs</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

 

Lesson learned:

- Pay attention on why your application is starting slow. Even a small mis-configuration like this might use up lots of resources

- Developer hour cost $$$. Make sure developers are not burning $$$ on non-trivial things like the slow application start

- Focus on how to increase developer productivity. I have seen devs compiling the whole project after every change to test it. Watch/educate developers to see if they are using the tools correctly

- Java/JVM is fast. Like every other programming language, developers can misuse/misconfigure it to make it slow

- etc