The first part of this series explains typical steps that a survey analyst takes to prepare free-text comments for Natural Language Processing (NLP). In this part, we describe four methods by which a cleaned and standardized corpus lets NLP reveal insights.
From the document-term matrix, software can total how many times each word appears in the corpus as a whole (all of the text questions combined text). A bar chart with the most frequent words can depict the totals. As in the bar chart below, which shows the frequently-used significant words in the first 25 posts from this blog, this simple effort makes clear the thrust of the writing. The word “survey” leads the way with almost 250 appearances. At the bottom end, “people” appears about 40 times. Note that the words have not been lemmatized.
A more visually compelling way to convey a similar finding relies on a word cloud plot. Below is a word cloud of the seventy-five most frequent words used in the same blog. The location of a word in the cloud has no meaning, but the font size and color, which you can pick, corresponds to its relative frequency. Thus, “survey” and “question” appear most often.