How to get a Token from a Lucene TokenStream?
Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = termAttribute.term();}
Edit: The new way
According to Donotello, TermAttribute
has been deprecated in favor of CharTermAttribute
. According to jpountz (and Lucene's documentation), addAttribute
is more desirable than getAttribute
.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);tokenStream.reset();while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = charTermAttribute.toString();}
This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);stream.reset();while (stream.incrementToken()) { System.out.println(cattr.toString());}stream.end();stream.close();
For the latest version of lucene 7.3.1
// Test the tokenizer Analyzer testAnalyzer = new CJKAnalyzer(); String testText = "Test Tokenizer"; TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText)); OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); try { ts.reset(); // Resets this stream to the beginning. (Required) while (ts.incrementToken()) { // Use AttributeSource.reflectAsString(boolean) // for token stream debugging. System.out.println("token: " + ts.reflectAsString(true)); System.out.println("token start offset: " + offsetAtt.startOffset()); System.out.println(" token end offset: " + offsetAtt.endOffset()); } ts.end(); // Perform end-of-stream operations, e.g. set the final offset. } finally { ts.close(); // Release resources associated with this stream. }
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html