How to get a Token from a Lucene TokenStream? How to get a Token from a Lucene TokenStream? java java

How to get a Token from a Lucene TokenStream?


Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);while (tokenStream.incrementToken()) {    int startOffset = offsetAttribute.startOffset();    int endOffset = offsetAttribute.endOffset();    String term = termAttribute.term();}

Edit: The new way

According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);tokenStream.reset();while (tokenStream.incrementToken()) {    int startOffset = offsetAttribute.startOffset();    int endOffset = offsetAttribute.endOffset();    String term = charTermAttribute.toString();}


This is how it should be (a clean version of Adam's answer):

TokenStream stream = analyzer.tokenStream(null, new StringReader(text));CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);stream.reset();while (stream.incrementToken()) {  System.out.println(cattr.toString());}stream.end();stream.close();


For the latest version of lucene 7.3.1

    // Test the tokenizer    Analyzer testAnalyzer = new CJKAnalyzer();    String testText = "Test Tokenizer";    TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);    try {        ts.reset(); // Resets this stream to the beginning. (Required)        while (ts.incrementToken()) {            // Use AttributeSource.reflectAsString(boolean)            // for token stream debugging.            System.out.println("token: " + ts.reflectAsString(true));            System.out.println("token start offset: " + offsetAtt.startOffset());            System.out.println("  token end offset: " + offsetAtt.endOffset());        }        ts.end();   // Perform end-of-stream operations, e.g. set the final offset.    } finally {        ts.close(); // Release resources associated with this stream.    }

Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html