How to Tokenize Words and Sentences using NLTK and NLP

Mohit Varikuti
3 min readAug 4, 2021

--

The NLTK module is a large toolkit designed to assist you with many aspects of Natural Language Processing (NLP). NLTK can help you with anything from breaking sentences from paragraphs to splitting apart words, identifying the part of speech of those words, emphasizing the important subjects, and even assisting your computer in comprehending the text. In this article, we’ll look at the field of sentiment analysis, often known as opinion mining.

You’ll need the NLTK module as well as Python to get started.

If you don’t already have Python, go to Python.org and download the most recent version for Windows. You should be able to execute the following code on a Mac or Linux system:

sudo apt-get install python3

After that, you’ll need NLTK 3. Pip3 will be the most convenient way to install the NLTK module.

This is done for all users by opening cmd.exe, bash, or whichever shell you use and typing:

pip3 install nltk

After that, we must install some of the NLTK components. Open python using your preferred method and type:

import nltk
nltk.download()

Unless you’re running headless, a GUI will appear.

Choose “all” to download all packages, then click “download.” You’ll get all of the tokenizers, chunkers, various algorithms, and corpora as a result of this. If you’re short on space, you can choose to manually download everything. The NLTK module will take up around 7MB, while the full nltk data directory, which includes your chunkers, parsers, and corpora, will take up about 1.8GB.

If you are operating headless, you can install everything by running Python and doing the above code plus these steps:

d (for download)

all (for download everything)

That will download everything for you without you having to do anything.

These are the most common words you’ll hear when you first join the Natural Language Processing (NLP) field, although there are many more that we’ll go over later. After that, let’s look at an example of how the NLTK module may be used to tokenize anything.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "The dog was called Wellington. It belonged to Mrs. Shears who was our friend. She lived on the opposite side of the road, two houses to the left."

print(sent_tokenize(EXAMPLE_TEXT))

At first glance, tokenizing by things like words or phrases may appear to be a simple task. It can be in a lot of sentences. A simple.split(‘. ‘), or dividing by period followed by a space, would most likely be the initial step. Then you might use regular expressions to separate the text by period, space, and capital letter. The issue is that people like Mrs. Shears, among others, would give you problems. Splitting by word may be difficult as well, especially when dealing with concatenations like we and are to we’re. With this seemingly simple, yet really complicated procedure, NLTK is going to go ahead and save you a ton of time.

The above code will generate a list of phrases, which you can run over using a for loop:

[‘The dog was called Wellington.’, ‘It belonged to Mrs. Shears who was our friend.’, ‘She lived on the opposite side of the road, two houses to the left.’]

As a result, we’ve developed tokens, or phrases. Instead, let’s tokenize by word this time:

print(word_tokenize(EXAMPLE_TEXT))

Now our output is: [‘The’, ‘dog’, ‘was’, ‘called’, ‘Wellington’, ‘.’, ‘It’, ‘belonged’, ‘to’, ‘Mrs.’, ‘Shears’, ‘who’, ‘was’, ‘our’, ‘friend’, ‘.’, ‘She’, ‘lived’, ‘on’, ‘the’, ‘opposite’, ‘side’, ‘of’, ‘the’, ‘road’, ‘,’, ‘two’, ‘houses’, ‘to’, ‘the’, ‘left’, ‘.’]

There are a couple of things to keep in mind here. First and foremost, note that punctuation is considered as a distinct symbol. Also, note how the word “shouldn’t” is divided into “should” and “n’t.” Finally, observe how “pinkish-blue” is handled as the “one word” it was intended to become. It’s really cool!

Now that we have these tokenized words in front of us, we must consider our next steps. We begin to think about how we could draw meaning from these words. Many words may obviously be assigned a monetary value, but there are a handful that are essentially useless. These are a type of “stop word” that we may also deal with. Anyway, that’s pretty much it for NLP and NLTK fundamentals; it’s both entertaining and complex.

--

--

Mohit Varikuti
Mohit Varikuti

Written by Mohit Varikuti

Im some random highschooler on the internet who likes to write about AI and tech and stuff. Leave a follow if u like my stuff I really appreciate it!

No responses yet