Calculating Word Frequencies with Regular Expressions : String Operation « Regular Expressions « Java

Calculating Word Frequencies with Regular Expressions

import java.nio.CharBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.util.Map;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordCount {
  public static void main(String args[]) throws Exception {
    String filename = "";

    // Map File from filename to byte buffer
    FileInputStream input = new FileInputStream(filename);
    FileChannel channel = input.getChannel();
    int fileLength = (int) channel.size();
    MappedByteBuffer buffer =, 0,

    // Convert to character buffer
    Charset charset = Charset.forName("ISO-8859-1");
    CharsetDecoder decoder = charset.newDecoder();
    CharBuffer charBuffer = decoder.decode(buffer);

    // Create line pattern
    Pattern linePattern = Pattern.compile(".*$", Pattern.MULTILINE);

    // Create word pattern
    Pattern wordBreakPattern = Pattern.compile("[\\p{Punct}\\s}]");

    // Match line pattern to buffer
    Matcher lineMatcher = linePattern.matcher(charBuffer);

    Map map = new TreeMap();
    Integer ONE = new Integer(1);

    // For each line
    while (lineMatcher.find()) {
      // Get line
      CharSequence line =;

      // Get array of words on line
      String words[] = wordBreakPattern.split(line);

      // For each word
      for (int i = 0, n = words.length; i < n; i++) {
        if (words[i].length() > 0) {
          Integer frequency = (Integer) map.get(words[i]);
          if (frequency == null) {
            frequency = ONE;
          } else {
            int value = frequency.intValue();
            frequency = new Integer(value + 1);
          map.put(words[i], frequency);

Related examples in the same category

1.Regular expression: Split DemoRegular expression: Split Demo
2.Replacing String Tokenizer Replacing String Tokenizer
3.String replaceString replace
4.String splitString split
5.Simple splitSimple split
6.Print all the strings that match a given pattern from a filePrint all the strings that match a given pattern from a file
7.Quick demo of Regular Expressions substitutionQuick demo of Regular Expressions substitution
8.Parse an Apache log file with StringTokenizerParse an Apache log file with StringTokenizer
9.StringConvenience -- demonstrate java.lang.String convenience routineStringConvenience -- demonstrate java.lang.String convenience routine
10.Split a String into a Java Array of Strings divided by an Regular ExpressionsSplit a String into a Java Array of Strings divided by an Regular Expressions
11.Regular Expression Replace
12.Java Regular Expression : Split text
13.Java Regular Expression :split 2
14.Get all digits from a string
15.Strip extra spaces in a XML string
16.Remove trailing white space from a string
17.Create a string search and replace using regex
18.Split-up string using regular expression
19.Apply proper uppercase and lowercase on a String
20.Regular Expression Search and Replace Program
21.Searching and Replacing with Nonconstant Values Using a Regular Expression
22.Use Matcher.appendReplacement() to match [a-zA-Z]+[0-9]+
23.Ignore case differences when searching for or replacing substrings.
24.Use replaceAll() to ignore case when replacing one substring with another
25.Extract a substring by matching a regular expression.
26.Match string ends
27.Match words
28.Match punct
29.Match space
30.Determining If a String Matches a Pattern Exactly
31.Removing Duplicate Whitespace in a String
32.Split the supplied content into lines, returning each line as an element in the returned list.
33.Get First Found regex
34.Get Found regex
35.Get First Not Empty String in a String list