Ruby - Counting Sentences and Paragraphs in Text file

Introduction

Sentences and paragraphs have different splitting criteria.

Sentences end with full stops, question marks, and exclamation marks.

They can be separated with dashes and other punctuation, but we won't worry about these rare cases here.

Instead of asking Ruby to split the text on one type of character, you simply ask it to split on any of three types of characters, like so:

Demo

lines = File.readlines("main.rb") 
line_count = lines.size # w w  w .ja  va2  s. c om
text = lines.join 
sentence_count = text.split(/\.|\?|!/).length

Let's look at the regular expression directly:

/\.|\?|!/ 

The forward slashes at the start and the end are the usual delimiters for a regular expression, so those can be ignored.

The first section is \., and this represents a full stop.

You can't just use . without the backslash.

. represents "any character" in a regular expression, so it needs to be escaped with the backslash to identify itself as a literal full stop.

A question mark in a regular expression usually means "zero or one instances of the previous character".

The ! is not escaped, as it has no other meaning in terms of regular expressions.

The pipes | separate the three main characters, which means they're treated separately so that split can match one or another of them.

puts "Test! I. It? Yes.".split(/\.|\?|!/).length #4 

Paragraphs can be split by a double newline. For example:

text = %q{ 
This is a test of 
paragraph one. 

This is a test of 
paragraph two. 

This is a test of 
paragraph three. 
} 

puts text.split(/\n\n/).length #3 

Demo

lines = File.readlines("main.rb") 
line_count = lines.size #  w  w  w  .  j a  v a  2 s .co  m
text = lines.join 

paragraph_count = text.split(/\n\n/).length 
puts "#{paragraph_count} paragraphs" 

sentence_count = text.split(/\.|\?|!/).length 
puts "#{sentence_count} sentences"

Result

Related Topic