This package contains a library, TokensRegex, for matching regular expressions over tokens. TokensRegex is incorporated into the {@link edu.stanford.nlp.pipeline.TokensRegexAnnotator} and {@link edu.stanford.nlp.pipeline.TokensRegexNERAnnotator}.
{@link edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor} and {@link edu.stanford.nlp.ling.tokensregex.SequenceMatchRules} describes the language and how the extraction rules are created
At the core of TokensRegex are the
{@link edu.stanford.nlp.ling.tokensregex.TokenSequenceMatcher} and
{@link edu.stanford.nlp.ling.tokensregex.TokenSequencePattern} classes which
can be used to match patterns over a sequences of tokens.
The usage is designed to follow the paradigm of the Java regular expression library
java.util.regex
. The usage is similar except that matches are done
over List<CoreMap>
instead of over String
.
List<CoreLabel< tokens = ...;
TokenSequencePattern pattern = TokenSequencePattern.compile(...);
TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
The classes {@link edu.stanford.nlp.ling.tokensregex.SequenceMatcher} and {@link edu.stanford.nlp.ling.tokensregex.SequencePattern} can be used to build classes for recognizing regular expressions over sequences of arbitrary types
{@link edu.stanford.nlp.ling.tokensregex.MultiPatternMatcher} provides utility functions for finding expressions with multiple patterns. For instance, using {@link edu.stanford.nlp.ling.tokensregex.MultiPatternMatcher.findNonOverlapping} you can find all nonoverlapping subsequences for a given set of patterns.
To find character offsets of multiple word expressions in a String
,
can also use {@link MultiWordStringMatcher.findTargetStringOffsets}.