Create a WordCloud using R

Intro

Word clouds are an interesting way to visualize the frequency of words in a data set. As with anything, there are multiple ways to create a word cloud in R. The approach I find easiest uses the the “tm” package, which helps to clean up the data by removing special characters, punctuation, etc. Then “wordcloud2” package is used to display the most common words visually in the form of a word cloud.

UFO Sightings

To demonstrate this approach let’s create a word cloud for the description of UFO sightings collected by THE NATIONAL UFO REPORTING CENTER. The first thing we’ll do is load our packages and the data.

# Load packages
library(tidyverse)
library(kableExtra)
library(tm)
library(wordcloud2)

# read data
data <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")

To keep our example efficient we’ll grab the “description” for just the first 1,000 rows.

# select the field in the dataset with which you want to create the word cloud.
Field <- head(data$description, 1000) # for illustrative purpose only the first 1000 rows are used.

We don’t want common words (a.k.a stop words) or punctuation to show up in the word cloud as they have little/no value in the data. The “tm” package makes easy work of removing these.

# the "tm" package cleans up the data in the field be seperating all of the words and normalizing them
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
Field2 <- as.character(Field)
Field2 <- Corpus(VectorSource(Field2))
Field2 <- tm_map(Field2, toSpace, "/")
Field2 <- tm_map(Field2, toSpace, "@")
Field2 <- tm_map(Field2, toSpace, "\\|")
Field2 <- tm_map(Field2, content_transformer(tolower))
Field2 <- tm_map(Field2, removeNumbers)
Field2 <- tm_map(Field2, removePunctuation)
Field2 <- tm_map(Field2, stripWhitespace)
Field2 <- tm_map(Field2, removeWords, stopwords("english"))

Once we’ve cleaned up the data we can start counting the number of times a word shows up. The end result is a data frame of words and the count of occurrences.

# this section of code takes all of the individual words and determines their frequency
dtm <- TermDocumentMatrix(Field2)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

Creating the Word Cloud

The last step is to pass the word frequency data frame to the “Wordcloud2” package. As in the example below, it’s usually a good idea to limit the results to only the top results for aesthetic purposes. Feel free to share your examples with me or reach out if you have any questions.

# this creates the word cloud image
wordcloud2(d[1:50,1:2])
Avatar
Michael Molloy
Senior Manager, IT Risk & Security

vCISO | Cybersecurity | IT Compliance | Data Analytics