Elasticsearch is a search and analytics engine for textual, numerical, geospatial and all other sorts of data, both structured and unstructured. It comes with simple REST APIs, it’s fast and scalable, and the best thing - it’s open and free. But to make the most of it in your Python project, you should follow a handful of best practices. You’ll find them in this article.
- Elasticsearch & Python - getting started
- Best tips for Elasticsearch & Python users
- Make the most of Elasticsearch in your Python project
Elasticsearch & Python - getting started
When it comes to integrating Elasticsearch with Python, it's best to compare the results achieved when working with the two most popular frameworks:
- Django is a great framework that excels in having richer libraries that allow you to do more with less effort. However, it is still relatively rigid. A good guide to configuring Elasticsearch with Django can be found here. If you're interested in this topic, I also recommend downloading Django Haystack, a fantastic wrapper for a variety of search engines, including Elastic.
- Flask is another popular micro-framework that compensates for higher overhead with greater flexibility. In this tutorial, you'll learn how to integrate Elasticsearch with Flask.
Learn more about the differences between Django and Flask >>
Regardless of the language/framework used in the application, the Elasticsearch engine works as an independent service that requires initial installation/configuration (including JDK installation, since Elasticsearch is based on Java). Then, integration with Django or Flask is fairly straightforward, and the easiest approach to ensure this is via specific libraries available for a given framework.
Elasticsearch as a search engine
Elasticsearch is designed to provide a robust full-text search that is fast and easy to use (a keyword search). Embedding techniques are a great tool to capture linguistic information from a text. By indexing embeddings and scoring them based on vector distance, we can compare documents using a similarity concept that goes beyond word-level overlap.
In short, we convert text (words, phrases, paragraphs, or whole texts) into a vector of numbers using advanced neural network-based algorithms. This text format (embedding, scattered representation) is almost impossible for a human to understand. Depending on the goal for which the neural network has been trained, semantically comparable texts may have comparable vector representations. At this stage, we can apply mathematical methods to compare the similarity of the vectors representing the text, and finally develop an engine to find the most comparable sentences.
This is especially handy if you want to use Elastic to search for comparable documents - by developing several quasi ML methods (you can add your own vector representation - which is a big plus) or introducing metrics to compare vector representations.
GOOD TO KNOW: Traditionally, the vector for a text is based on the number of times each phrase occurs in the lexicon (which is usually called Bag of Words). The main difference is density: the encoded vectors have between 100 and 1,000 dimensions, compared to the 50,000+ or so dimensions of Bag of Words vectors. Be careful though, since sentence embeddings don't account for linguistic elasticity (synonymy, word order shifts), they are most effective for relatively short paragraphs.
Elasticsearch as a database
NoSQL databases (non-relational - unstructured sets of documents) are designed to get unstructured data, being useful when for example a user searching for information on a web page. What’s crucial, they are faster compared to the traditional SQL-based databases, which require a full structure/tree of data and its relationships.
Elasticsearch, as a textbook example of a NoSQL database, shines in terms of speed due to its extremely fast algorithms and data management approach. Feed it with a JSON document, and it will make a highly intelligent guess about its type. It does a spectacular job of handling numeric values, Boolean values, and timestamps. Careful though, as you need to modify your schemas to create excellent search and/or analytics. Elasticsearch has an extensive set of powerful built-in tools to help you do this, such as dynamic templates, multi-field objects, and so on - but you need to tweak them to benefit from them.
Although Elasticsearch can function as a single database, this solution has some drawbacks and is basically not recommended. It’s mostly used as a search tool that works in tandem with another database. Ideally, this should be a system with a higher emphasis on constraints, consistency, and resilience, as well as ease and transactional updatability. Only then, its master dataset should be sent asynchronously to Elasticsearch, to boost efficiency through the roof.
Also, remember that the entire object graph you want to search must be indexed, so denormalize your documents before indexing. Generally, it’s a good practice to build our mappings and store our documents in document-oriented databases like Elasticsearch so they are optimal for search and retrieval.
Still, you must be aware that Elasticsearch has no login and authorization mechanisms. Thus, anyone who can connect to your Elasticsearch cluster should be considered a "superuser", granted full freedom of modifying and deleting the data.
USE CASE: ELASTICSEARCH FOR MANAGING GEODATA
If you want to work with geographic data, elastic gives you the ability to index it and perform various operations on it. In general, there are fewer features than in the case of PostGIS (a geo extension for the PostgreSQL database), but it is still a fascinating approach.
Best tips for Elasticsearch & Python users
1. Auto-generated identifiers
Enable auto-generated identifiers, which allows for faster indexing so that when Elasticsearch indexes a document with an explicit identifier, it doesn't waste time checking if a document with the same identifier already exists in the shard.
2. Maintain your database
If your main database is integrated with Elastic (only) as a search engine, you need to guarantee that the data is synchronized, thus fully updated in Elastic. You don't want to display search results that have already been deleted from the database, it’s crucial to show only the most up-to-date data in the search results.
3. Use cross-cluster replication
By using two clusters, setting up cross-cluster replication to copy data from one cluster to the other, and directing all searches to the cluster hosting the follower indexes, search activity no longer consumes resources from indexing on the cluster hosting the leader indexes.
4. Implement bulk requesters
Bulk requests are far more powerful than index requests with individual documents. You should benchmark on a single node with a single shard to determine the appropriate size of a bulk request. Try indexing 100 documents at a time first, then 200, 400, and so on, doubling the number of documents in a bulk request in each benchmark run. When the indexing performance reaches a plateau, you have reached the appropriate size for the bulk request of your data.
5. Disable replicas during the initial loadedit (and swapping)
To speed up indexing, set the index.number of replicas to 0 if you have a large amount of data to import. Since there are no replicas, the failure of a single node can result in data loss, so the data must be stored elsewhere so that this initial load can be rerun if there is a problem.
When the initial load is complete, you can reset the index.number of replicas to their original value. If index.refresh interval is specified in the index settings, it may be helpful to disable it during the first load and set it back to the original value after the first load is complete. Also, disabling swapping ensures that the operating system does not shut down the Java process.
6. Set the value for the refresh interval
By default, Elasticsearch refreshes indexes every second, but only for indexes that have received one or more search queries in the last 30 seconds. If you have no or very low search traffic (so, <1 search/ 5 minutes) and want to optimize indexing performance, this is fine for you.
On the other hand, if your index receives many search queries, this default practice implies that Elasticsearch refreshes your index every second. If you can afford to extend the time between when a page is indexed and when it becomes visible, setting the index.refresh interval to a higher value, such as the 30s, can help improve indexing performance.
7. Use multiple workers/threads to send data to Elasticsearchedit
It is unlikely that a single thread delivering bulk queries can exhaust the indexing capacity of an Elasticsearch cluster. To take full advantage of the cluster's resources, you should send data from multiple threads or processes. This should help minimize the cost of each fsync while making better use of the cluster's resources.
Watch for the response codes TOO MANY REQUESTS (429) (EsRejectedExecutionException in the Java client), which indicate that Elasticsearch cannot keep up with the current indexing speed. If this occurs, you should pause indexing for a few seconds before trying again, preferably with a random exponential backoff.
8. Arrange fields in the same order in documentsedit
Since many documents are compressed in blocks, it is more likely that larger duplicate strings will be found in these _source_documents if fields always occur in the same order.
9. Use best_compressionedit
The _source fields and the stored fields can easily consume a significant amount of memory. By using the best compression codec, they can be compressed more aggressively.
10. Use index sorting to colocate similar documentsedit
When Elasticsearch stores _source, it compresses multiple documents at once to improve the overall compression ratio. For example, it is very common for documents to have the same field names and certain field values, especially if the fields have low cardinality or a zipfan distribution. Documents are compressed together by default in the order in which they are added to the index. If you allow index sorting, they are compressed in sorted order instead. Documents with similar structures, fields, and values should be sorted together to improve the compression ratio.
11. Use the smallest numeric type that is sufficientedit
The type of numeric data you use can have a significant impact on disk utilization. In particular, integers should be stored in an integer type (byte, short, integer, or long), and floating-point numbers should be stored in a scaled_float if possible, or in the smallest type that fits the use case: choosing float over double or half_float over float helps save disk space.
Make the most of Elasticsearch in your Python project
Elasticsearch is beneficial for most projects, especially for large ones with huge amounts of data and documents. However, there is no reason to use it unless you are looking to configure/develop your online application as quickly as possible or if your main database has sufficient search capabilities. And, if you choose to use it, remember to follow the best practices for Elasticsearch to improve its speed and performance.
Always consider whether you need an advanced search engine. Or if a better option is currently available on the market. Remember, we may be developing a simple database-driven search, which will most likely be the first and most important component of our program. And as always, keep in mind that "overengineering" the system is the worst thing you can do.
If you can't decide, it's always good to browse the documentation, look for help on StackOverflow or Elastic Discuss, or just ask your question in the comment section below.