Wednesday 10 October 2018

Lucene and Hyphen

Based on Lucene 4.7.

As Lucene is a high-performance text search engine because of its inverted index technique, I also implemented lucene in my application for full-text search. After the implementation, search becomes more faster, results become more accurate and everything worked very smoothly. Well that was till there is no issue in search results. oh! that doesn't mean lucene is buggy.

Let me explain the scenario first, one day an issue raised by client for lucene search implementation that "search string with hyphen(-) is not giving relevant results". But The correct interpretation of hyphens in the context of word boundaries is challenging.


Suppose I have string "sample-text" to be indexed. Then StandardTokenizer (Used by StandardAnalyzer for tokenizing) splits "sample-text" into "sample","text" and index the two words seperately instead of indexing as a single whole word.  It is quite common for separate words to be connected with a hyphen: “up-to-date,” “sugar-free,” “good-looking,” and so on. A significant number are hyphenated names, such as “John-Paul.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keep the hyphen out of the default definition.

After reading behaviour of StandardAnalyzer, I realized that this is the expected behaviour defined in StandardAnalyzer. But as lucene is open source library I decided to modify the logic of tokenizer to avoid tokenization on hyphen.

Source of StandardTokenizer class which is to be modified was generated by jFlex and I was completely unware of that. After reading some interesting stuff of jFlex, I was confident enough to do some R&D on StandardTokenizerImpl.jflex.

First of all download Jflex and install it according to your operating system.

If you have windows OS then after downloading do the following steps:

1. Unzip the file into directory you want. then go to bin folder and set java home in jflex.bat
set JAVA_HOME=C:\Program Files\Java\jdk1.7.0_67
(Set java according to your version)

2. Also set JFLEX_HOME according to your directory in which you have unzipped your file.

4. Then run jflex.bat. where you can give your input file and generate JAVA file.

OK..!! let's come to the main topic, To give support of hyphen in StandardAnalyzer we have to make changes in SUPPLEMENTARY.jflex-macro.

Some or all of the following characters may be tailored to be in MidLetter, depending on the environment:

  • U+002D ( - ) HYPHEN-MINUS
  • U+055A ( ՚ ) ARMENIAN APOSTROPHE
  • U+058A ( ֊ ) ARMENIAN HYPHEN
  • U+0F0B (  ) TIBETAN MARK INTERSYLLABIC TSHEG
  • U+1806 (  ) MONGOLIAN TODO SOFT HYPHEN
  • U+2010 ( ‐ ) HYPHEN
  • U+2011 (  ) NON-BREAKING HYPHEN
  • U+201B ( ‛ ) SINGLE HIGH-REVERSED-9 QUOTATION MARK
  • U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
  • U+30FB ( ・ ) KATAKANA MIDDLE DOT
  • U+FE63 (  ) SMALL HYPHEN-MINUS
  • U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS


In UnicodeSet notation, this is: [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]

You can check Unicode of hyphen online through unicode converter.

In my case I wanted to avoid tokenization on HYPHEN-MINUS, that's why I have added below line of code in SUPPLEMENTARY.jflex-macro.


                                                           MidLetterSupp = ( [\u002D]  )


StandardTokenizerImpl.jflex file internally includes SUPPLEMENTARY.jflex-macro file.
To generate StandardTokenizerImpl.java give input file as StandardTokenizerImpl.jflex to Jflex engine as Lexical specification and click on Generate.

After using generated StandardTokenizerImpl.java rebuild the index.

And its Miracle!  It worked as I wanted.  Exactly as I wanted.  My test cases passed with flying colors and Clients satisfied with new results.  Great!