How to implement complex full-text search with Hibernate Search
- by 7wData
This is the second part of the Full-Text Search with Hibernate Search series. In the first part, I showed you how to add Hibernate Search to your project and to perform a very basic full-text query which returned all entities which contained a set of words. This query already returned a much better result than the typical SQL or JPQL query with a WHERE messageLIKE :searchTerm clause. But Hibernate Search can do a lot more.
But you can do a lot more than that with Hibernate Search. It provides you an easy way to use Lucene’s analyzers to process the indexed Strings and also find texts that use different word forms or even synonyms of your search terms.
Let’s have a quick look at the general structure of an analyzer before I show you how to create one with Hibernate Search. It consists of 3 phases, and each of them can perform multiple steps. The CharFilter adds, removes or replaces certain characters. That is often used to normalize special characters like ñ or ß. The Tokenizer splits the text into multiple words. The Filter adds, removes or replaces specific tokens.
The separation in 3 phases and multiple steps allows you to create very complex analyzers based on a set of small, reusable components. I will use it in this post to extend the example from the previous post so that I get the same results when I search for “validate Hibernate”, “Hibernate validation” and “HIBERNATE VALIDATION”.
That requires the search to handle words in upper and lower case in the same way and to recognize that “validate” and “validation” are two different forms of the same word. The first part is simple and you could achieve that in a simple SQL query. But the second one is something you can’t do easily in SQL. It is a common full-text search requirement which you can achieve with a technique called stemming. It reduces the words in the index and in the search query to its basic form.
OK, let’s define an analyzer that ignores the case upper and lower case and that uses stemming.
As you can see in the following code snippet, you can do that with an @AnalyzerDef annotation, and it’s not too complicated.
The analyzer definition is global and you can reference it by its name. So, better make sure to use an expressive name that you can easily remember. I choose the name textanalyzer in this example because I define a generic analyzer for text messages. It’s a good fit for most simple text attributes.
This example doesn’t require any character normalization or any other form of character filtering. The analyzer, therefore, doesn’t need any CharFilter.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
Strategies for simplifying complex Salesforce data migrations – Free Webinar
27 March 2024
5 PM CET – 6 PM CET
Read MoreYou Might Be Interested In
Why Healthcare is Behind the Data Curve
6 Sep, 2017The healthcare sector has been trailing industries such as banking and retail when it comes to adoption of data analytics, …
Cognition and the future of marketing
5 Oct, 2016Many of us remember that night in 2011 when Jeopardy! all-stars Brad Rutter and Ken Jennings met their fate against …
Big data for small biz is leveling the playing field
6 Mar, 2016The cycles that accompany advances in computing are fairly predictable. Technology starts off in a lab setting understood by only …
Recent Jobs
Do You Want to Share Your Story?
Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.