The article is a part of JPA & Spring pitfalls series. Below you can find the actual list of all articles of the series:

Spring

JPA

Preface

  • What can go wrong with repository.findAll()?
  • Is there anything special about using pagination within a transaction?
  • Can JPA’s first-level cache visibly influence the behavior of my code?

If you want to know the answers, read this article. 

pagination content

We’ll start from an initial task’s scenario, gradually extending it and showing what can go wrong, how to fix it and finally get the right solution. 

The code is based on Spring 5 and JPA 2.1.

Disclaimer: In some parts of the code I decided to resign from using nice Java 8 streaming notation because I think that using the “traditional notation” there makes the described pitfall easier to spot and understand by a wider audience.

Task no. 1

Our application is about storing a vast number of articles and scientific papers in a database. Each article that we store consists of the original PDF file (which may be up to 10MB, because some of them are not text, but scanned documents), their text contents (as string), information about language and date of the most recent review.

Our task is to implement a method that accepts one parameter, which is a filtering predicate and updates review date of all articles matching that predicate. We’re assuming that the filtering operation is non-trivial and must take place on Java side.

Initial approach

Our initial solution could be as easy as the task sounds:

Running a few unit and integration tests, everything works as expected. On the staging environment it also looks good.

Deploying to production

However, after the first shot on the production environment, we’re getting a very unpleasant OutOfMemoryError.

The first problem is that article data sets used on local testing and staging environments were apparently different (significantly smaller) from the production database. Because of that we were able to retrieve all the articles at once and keep them in memory, however, we can’t do this with articles from the production database, because their size exceeds our memory resources and will be growing with time.

Batching

Alright, we learnt the lesson and came up with basic JPA pagination solution.

We are fetching and processing articles in batches of safe size and after each iteration entity objects may safely be garbage collected if memory usage is high. Is that right? Not really.

First-level caching

When testing the above code against the production database, we would see that the problem is still the same. This is because all entities retrieved during a single transaction are in managed state and they are stored altogether in JPA first-level cache until this transaction ends. Therefore, even though the code suggest something else, we’re still trying to keep all the articles in memory, because we’re retrieving them within the boundaries of the same transaction.

One option could be splitting fetching and processing into separate transactions, but of course we would then have transactional guarantees only per retrieved page, which usually may not be acceptable.

The solution here would be to add entityManager.flush() and entityManager.clear() before fetching next batch of entities. This is one of the very rare cases where using these methods manually is explicable and completely makes sense.

By doing entityManager.flush(), we’re flushing all the registered changes done to the entities, which are in managed state at the moment. However, flushing the changes to the database, doesn’t mean that they are automatically committed. They will be sent to the database, but will still be bound to the ongoing transaction, so they will be waiting (but now on the DB side) until the transaction is committed. 

On the other hand, entityManager.clear() clears the persistence context, which means that the first-level cache is evicted and all managed entities become detached now.

So now any references to the retrieved entities aren’t stored in the JPA first-level cache between iterations and thus they may be garbage collected in danger of OutOfMemoryError.

Task no. 2

Let’s suppose that the next thing we want this method to do is not only update review date of all matching articles, but also we want to translate them. For this purpose we can leverage an existing method of TranslationService (public String translate(String text, Language languageFrom, Language languageTo), which translates given text in varying time, usually between 30-180 seconds.

The first thought is to just add the line with translationService.translate(), update article’s text field and move on.

Timeouts & splitting transactions

Executing such a time-consuming operation will at least cause the user to wait for ages until they get the response (if we are in REST context), but here it will also cause an exception and rollback, because we’re in transactional context, which is always limited by timeout.

By default timeout setting is dependent on the underlying transaction system. We can control it manually using @Transactional(timeout = …) attribute.

In this case it wouldn’t work for us, because we can’t predict how much time it may take and on the other hand, it would probably be a matter of minutes or hours, so it would be silly to allow locking the database for this transaction for so long.

Another way could be splitting this transaction into small and independent transactions with a reasonable timeout. But this way we would lose transactional behavior for updating review date for all or none, and the user would still need to wait ages for the response.

A solution which may turn out to be the most suitable here is asynchronous execution. “All or none” behaviour for review date would be sustained, the user of the method would get the response quickly and the database wouldn’t be blocked with excessively extended timeouts. However, this way we don’t have “all or none” behaviour for translation operations (they get translated gradually) and also getting a response by a user doesn’t mean anymore that all the articles are already processed, but means only that this work has been scheduled.

Usually, there is no single perfect solution and each has its pros & cons, so you need to choose depending on the situation. In our current scenario asynchronous processing sounds best.

There are plenty of ways and advanced tools in Java to implement asynchronous processing. One that is easy to use in Spring is the @Async annotation. However, we need to be careful, because there are certain rules about how and when method with @Async will work (it’s quite similar to rules that apply to @Transactional, you can see Pitfall #2 of this article). Especially we still need to remember that:

  • Transactional context is not propagated to methods with @Async
  • Methods with @Async should be public and in a different bean that the caller bean
  • It’s better to pass IDs or DTOs on transaction boundaries than pass entity

So in the end, we create new translateAsynchronously() method in the TranslationService and replace the call to translationService in the ArticleService with translationService.translateAsynchronously(article.getId(), translationLanguage);.

Essence in a nutshell

  • Before deploying to production, always test your code against the production database
  • When possible, avoid fetching all table records at once and filter rows on the database side
  • When retrieving potentially large amount of rows, use pagination
  • Remember about the JPA first-level cache when using pagination within transaction
  • Avoid executing time-consuming operations within transactions to avoid timeouts and unnecessary DB locking
  • Transactional context isn’t propagated to methods with @Async

Thanks for reading and stay tuned!

Java Developer

Michał is an eager fan of the fresh approach to Java programming. Enthusiast of Spring tech stack and refactoring techniques. Enjoys solving Java quirks, algorithmic puzzles and... Rubik's cube. Privately, amateur drummer.