Software Development

Backend Development

JPA: Pagination Pitfalls & OutOfMemoryError [Spring & JPA Pitfalls Series]

28/11/2019 |

8 min read

Michał Marciniec

The article is a part of the JPA & Spring pitfalls series, which you can check out here.

JPA Pagination - Preface

What can go wrong with repository.findAll()?
Is there anything special about using pagination within a transaction?
Can JPA’s first-level cache visibly influence the behavior of my code?

If you want to know the answers, read this article.

We’ll start from an initial task scenario, gradually extending it and showing what can go wrong, how to fix it, and finally get the right solution. The code is based on Spring 5 and JPA 2.1.

Disclaimer: In some parts of the code, I decided to resign from using nice Java 8 streaming notation because I think that using the “traditional notation” makes the described pitfall easier to spot and understand by a wider audience.

Read: JPA: 8 Common Pitfalls [Spring & JPA Pitfalls Series]

OutOfMemoryError Java Heap Space Solution

Our application is about storing a vast number of articles and scientific papers in a database. Each article that we store consists of the original PDF file (which may be up to 10MB, because some are not text, but scanned documents), text contents (as string), information about language, and date of the most recent review.

@Entity
public class Article {
 
    @Id
    @GeneratedValue
    private Long id;
 
    @Lob
    private String text;
 
    private Language textLanguage;
 
    @Lob
    private byte[] originalFile;
 
    private Language originalLanguage;
 
    private LocalDate reviewDate;
 
    public void updateReviewDate() {
        this.reviewDate = LocalDate.now();
    }
 
}

Our task is to implement a method that accepts one parameter, which is a filtering predicate and updates the review date of all articles matching that predicate. We’re assuming that the filtering operation is non-trivial and must take place on the Java side.

Initial approach

Our initial solution could be as easy as the task sounds:

@Service
public class ArticleService {
 
    private final EntityManager entityManager;
 
    public ArticleService(EntityManager entityManager) {
        this.entityManager = entityManager;
    }
 
    @Transactional
    public List<String> listIsbnOfMatchingArticles(Predicate<Article> matchingPredicate) {
        entityManager.createQuery("SELECT a FROM Article a", Article.class).getResultList().stream()
                .filter(matchingPredicate)
                .forEach(Article::updateReviewDate);
    }
 
}

Running a few unit and integration tests, everything works as expected. On the staging environment, it also looks good.

Deploying to production

However, after the first shot on the production environment, we’re getting a very unpleasant OutOfMemoryError.

The first problem is that article data sets used on local testing and staging environments were apparently different (significantly smaller) from the production database. Because of that, we were able to retrieve all the articles at once and keep them in memory. However, we can’t do this with articles from the production database, because their size exceeds our memory resources and will be growing with time.

Read: JPA: N+1 SELECT Problem [Spring & JPA Pitfalls Series]

JPA pagination example – batching

Alright, we learned the lesson and came up with a basic JPA pagination solution.

@Service
public class ArticleService {
 
    private final EntityManager entityManager;
    private final int articlesPageSize;
 
    public ArticleService(EntityManager entityManager, @Value("articles-page-size") int articlesPageSize) {
        this.entityManager = entityManager;
        this.articlesPageSize = articlesPageSize;
    }
 
    @Transactional
    public void updateReviewDateOfMatchingArticles(Predicate<Article> matchingPredicate) {
        int page = 1;
        List<Article> articlesPage = fetchArticlesPage(page);
        while (!articlesPage.isEmpty()) {
            articlesPage.stream()
                    .filter(matchingPredicate)
                    .forEach(Article::updateReviewDate);
 
            ++page;
            articlesPage = fetchArticlesPage(page);
        }
    }
 
    private List<Article> fetchArticlesPage(int pageNumber) {
        return entityManager.createQuery("SELECT a FROM Article a", Article.class)
                .setFirstResult((pageNumber - 1) * articlesPageSize)
                .setMaxResults(articlesPageSize)
                .getResultList();
    }
 
}

We are fetching and processing articles in batches of safe size and after each iteration, entity objects may safely be garbage collected if memory usage is high. Is that right? Not really.

First-level caching

When testing the above code against the production database, we would see that the problem is still the same. This is because all entities retrieved during a single transaction are in managed state and they are stored all together in JPA first-level cache until this transaction ends. Therefore, even though the code suggests something else, we’re still trying to keep all the articles in memory, because we’re retrieving them within the same transaction boundaries.

One option could be splitting fetching and processing into separate transactions, but of course, we would then have transactional guarantees only per retrieved page, which usually may not be acceptable. The solution here would be to add entityManager.flush() and entityManager.clear() before fetching next batch of entities. This is one of the very rare cases where using these methods manually is explicable and completely makes sense.

By doing entityManager.flush(), we’re flushing all the registered changes done to the entities, which are in the managed state at the moment. However, flushing the changes to the database doesn’t mean that they are automatically committed. They will be sent to the database, but will still be bound to the ongoing transaction, so they will be waiting (but now on the DB side) until the transaction is committed.

On the other hand, entityManager.clear() clears the persistence context, which means that the first-level cache is evicted and all managed entities become detached now.

@Transactional
public void updateReviewDateOfMatchingArticles(Predicate<Article> matchingPredicate) {
    int page = 1;
    List<Article> articlesPage = fetchArticlesPage(page);
    while (!articlesPage.isEmpty()) {
        articlesPage.stream()
                .filter(matchingPredicate)
                .forEach(Article::updateReviewDate);
 
        ++page;
        entityManager.flush();
        entityManager.clear();
        articlesPage = fetchArticlesPage(page);
    }
}

So now any references to the retrieved entities aren’t stored in the JPA first-level cache between iterations and thus they may be garbage collected in danger of OutOfMemoryError.

Spring Transactional TimeOut

Let’s suppose that the next thing we want this method to do is not only update the review date of all matching articles, but also we want to translate them. For this purpose, we can leverage an existing method of TranslationService (public String translate(String text, Language languageFrom, Language languageTo), which translates given text in varying times, usually between 30-180 seconds.

The first thought is to just add the line with translationService.translate(), update the article’s text field and move on.

@Transactional
public void translateAndUpdateReviewDateOfMatchingArticles(Predicate<Article> matchingPredicate, Language translationLanguage) {
    int page = 1;
    List<Article> articlesPage = fetchArticlesPage(page);
    while (!articlesPage.isEmpty()) {
        articlesPage.stream()
                .filter(matchingPredicate)
                .forEach(article -> translateAndUpdateReviewDate(article, translationLanguage));
 
        ++page;
        entityManager.flush();
        entityManager.clear();
        articlesPage = fetchArticlesPage(page);
    }
}
 
private void translateAndUpdateReviewDate(Article article, Language translationLanguage) {
    String translation = translationService.translate(article.getText(), article.getTextLanguage(), translationLanguage);
 
    article.setText(translation, translationLanguage);
    article.updateReviewDate();
}

Read: Spring Transactional Pitfalls [Spring & JPA Pitfalls Series]

Timeouts & splitting transactions

Executing such a time-consuming operation will at least cause the user to wait for ages until they get the response (if we are in REST context), but here it will also cause an exception and rollback because we’re in a transactional context, which is always limited by timeout.

By default timeout setting is dependent on the underlying transaction system. We can control it manually using @Transactional(timeout = …) attribute. In this case, it wouldn’t work for us, because we can’t predict how much time it may take, and on the other hand, it would probably be a matter of minutes or hours, so it would be silly to allow locking the database for this transaction for so long.

Another way could be splitting this transaction into small and independent transactions with a reasonable timeout. But this way we would lose transactional behavior for updating review date for all or none, and the user would still need to wait ages for the response. A solution that may turn out to be the most suitable here is asynchronous execution. “All or none” behavior for review date would be sustained, the user of the method would get the response quickly and the database wouldn’t be blocked with excessively extended timeouts. However, this way we don’t have “all or none” behavior for translation operations (they get translated gradually), and also getting a response by a user doesn’t mean any more than all the articles are already processed but means only that this work has been scheduled.

Usually, there is no single perfect solution and each has its pros & cons, so you need to choose depending on the situation. In our current scenario, asynchronous processing sounds best.

There are plenty of ways and advanced tools in Java to implement asynchronous processing. One that is easy to use in Spring is the @Async annotation. However, we need to be careful, because there are certain rules about how and when the method with @Async will work (it’s quite similar to rules that apply to @Transactional, you can see Pitfall #2 of this article). Especially we still need to remember that:

Transactional context is not propagated to methods with @Async
Methods with @Async should be public and in a different bean than the caller bean
It’s better to pass IDs or DTOs on transaction boundaries than pass entity

So in the end, we create new translateAsynchronously() method in the TranslationService and replace the call to translationService in the ArticleService with translationService.translateAsynchronously(article.getId(), translationLanguage);.

JPA: Pagination Pitfalls & OutOfMemoryError – wrap up

Before deploying to production, always test your code against the production database
When possible, avoid fetching all table records at once and filter rows on the database side
When retrieving potentially large amount of rows, use pagination
Remember about the JPA first-level cache when using pagination within the transaction
Avoid executing time-consuming operations within transactions to avoid timeouts and unnecessary DB locking
Transactional context isn’t propagated to methods with @Async

Thanks for reading and stay tuned!

Rated: 5.0 / 5 opinions