How to use unnest_tokens function: How to split by words in R text mining

The unnest_tokens () function is a function included in the tidytext package of the R programming language, and separates text data into tokens. This function processes text appropriately for the ‘tidy data’ format, making it useful for text mining and natural language processing. This function creates a new row for each token, leaving columns other than those containing the text intact.

Original Korean article: How to use unnest_tokens function: How to split by words in R text mining

The unnest_tokens function is a key tool in R text mining to divide sentences or documents into words. To analyze text data, you must first break sentences into tokens and connect them to the next step, such as word frequency or sentiment analysis. This article summarizes the tidytext-based tokenization flow and usage of the unnest_tokens function.

1. unnest_tokens() concept

The unnest_tokens() function is included in R's tidytext package and is used to tokenize text data. Tokenization is the process of breaking down long text strings into smaller units, such as words or sentences.

This function is useful for converting text data into a form that is easy to process and analyze. For example, you can separate words that make up a single document or sentence.

1) Installation and library loading

install.packages("tidytext")
library(tidytext)

2) basic usage of unnest_tokens

The basic function form can be seen like this.

unnest_tokens(data, output_column, input_column, token = "words", ...)
  • data: The data frame to tokenize.
  • output_column: The name of the new column in which to store the token.
  • input_column: Name of the column containing the text to tokenize.
  • token: Type of token (default is “words”).

example

library(dplyr)
library(tibble)

# example example
data <- tibble(id = c(1, 2), text = c("I love R", "Data science is awesome"))

# example example
tokenized_data <- data %>%
  unnest_tokens(word, text)

# example example
print(tokenized_data)

In this example, we used a tibble dataframe with an id column and a text column. We applied the unnest_tokens() function to tokenize the text in the text column into words, and stored the results in a new word column.

Additional options

  • drop : If set to FALSE, include the input column in the results.
  • to_lower: If set to FALSE, it is case sensitive.
  • strip_numeric , strip_punct , strip_mark , collapse etc: Additional text cleaning options.

This function is very flexible and can be applied to a variety of text data. It can be used with several tokenization options and other tidytext functions to perform more complex text analysis tasks.

  • Token: This is the basic unit when analyzing text, and can usually be a word, phrase, or sentence.
  • Tidy Text: This refers to a text data format in which each word (token) forms one line and is stored together with a document or other identifier.

2. unnest_tokens parameter

The unnest_tokens() function has several options that allow you to fine-tune the process of tokenizing text. I'll explain some of the main options below.

1) Basic parameters:

  1. data: The data frame to tokenize.
  2. output_column: The name of the new column in which to store the token.
  3. input_column: Name of the column containing the text to tokenize.
  4. token: Type of token (e.g. “words”, “characters”, etc.)

2) Additional options:

  1. drop: logical type. Whether to remove the input column from the results. The default is TRUE.
  2. to_lower: logical type. Whether to convert all characters to lowercase. The default is TRUE.
  3. strip_numeric: logical type. Whether to remove numbers. The default is FALSE.
  4. strip_punct: logical type. Whether to remove punctuation. The default is FALSE.
  5. collapse: string. Whether to concatenate tokens with this string. The default is NULL.
library(dplyr)
library(tidytext)

# example example
data <- tibble(id = c(1, 2), text = c("I love R", "Data science is awesome"))

# example example, example example example, example example
tokenized_data <- data %>%
  unnest_tokens(word, text, drop = FALSE, to_lower = FALSE)

# example example
print(tokenized_data)

3) Remarks

  • If you set drop = FALSE, the original input_column will be retained in the result even after tokenization.
  • If you set to_lower = FALSE, case will be preserved.
  • If you set strip_numeric = TRUE, numbers will be removed.
  • If you set strip_punct = TRUE, punctuation will be removed.

By combining these options, you can increase the precision of tokenization or simplify the preprocessing process.

  • input: The name of the column containing the text to tokenize.
  • output: The name of the new column in which to store the tokenized results.
  • token: This is an option that determines what unit to tokenize in. These include ‘words’, ‘characters’, ‘ngrams’, ‘sentences’, ‘lines’, ‘paragraphs’, and ‘regex’.

4) Omitted

You can omit the input and output parameters in the unnest_tokens() function, but in that case the function will use the default settings of the first column of the data frame as input and word as the output column name. So you can also use it in the following form:

text %>%
  unnest_tokens(token = "sentences")

However, if you do this, it can be difficult to clearly understand from code alone which columns are being tokenized and in which columns the tokens are stored. For code readability and maintainability, it is recommended to specify input and output explicitly.

Explicitly specifying column names is recommended because it makes it easier for readers of your code or when modifying your code later to know what the column means.

To download the R program, you can click the download link on the R program's official website (https://www.r-project.org/).

Good article to read together

  • Text replacement str_replace, str_replace_all functions
  • str_squish function to remove unnecessary spaces
  • Understanding Tibble and the as_tibble() function
  • Execute PHP and R code in conjunction
  • Importance and usage of pipe operator %>%

Key Checklist

  • Is the text column to be analyzed clear?
  • Have you decided which unit to divide into: sentences, words, or n-grams?
  • Is there a plan to remove stop words and analyze frequencies after tokenization?
  • Have you confirmed whether morpheme analysis is necessary in processing Korean text?

Good R statistics articles to read together

  • How to use the R pipe operator %>%: An easy way to read data analysis flows
  • Research Method Introduction to R Statistics: Understanding research design and analysis methods at a glance
  • Variables and Measurement R Statistics: Understanding independent variables, dependent variables and measurement levels
  • What is research: Summary of research concepts for introduction to R statistics

FAQ

What does the unnest_tokens function do?

The unnest_tokens function breaks long text into smaller, parsable chunks. The results separated into sentences, words, n-grams, etc. can be created in the form of a data frame and used for frequency analysis or visualization.

Why is tokenization needed in R text mining?

It is difficult for a computer to analyze an entire document directly into semantic units. Tokenization is a preprocessing step that divides text into analysis units, such as words, so that frequencies, co-occurrences, and sentiment scores can be calculated.

How do unnest_tokens results relate to word frequency analysis?

The tokenization result is usually one word per line. Afterwards, you can use dplyr functions such as count, group_by, and arrange to create a word frequency table or top keyword list.

Related Reading

FAQ

What is this article about?

This article is an English translation and global-reader adaptation of the Korean post “How to use unnest_tokens function: How to split by words in R text mining.” It preserves the original article’s main explanation, examples, and practical context.

Why is it translated into English?

The English version helps global readers access Thinknote articles through English search keywords while keeping the Korean source available as the original reference.

Where can I read the original Korean version?

You can read the original Korean article here: https://www.thinknote.co.kr/r-unnest-tokens-function/