How to Use unnest_tokens() in R Text Mining

작성자

카테고리:

The unnest_tokens () function is a function included in the tidytext package of the R programming language, and separates text data into tokens. This function processes text appropriately for the ‘tidy data’ format, making it useful for text mining and natural language processing. This function creates a new row for each token, leaving columns other than those containing the text intact.

Original Korean article: How to use unnest_tokens function: How to split by words in R text mining

The unnest_tokens function is a key tool in R text mining to divide sentences or documents into words. To analyze text data, you must first break sentences into tokens and connect them to the next step, such as word frequency or sentiment analysis. This article summarizes the tidytext-based tokenization flow and usage of the unnest_tokens function.

1. unnest_tokens() concept

The unnest_tokens() function is included in R's tidytext package and is used to tokenize text data. Tokenization is the process of breaking down long text strings into smaller units, such as words or sentences.

This function is useful for converting text data into a form that is easy to process and analyze. For example, you can separate words that make up a single document or sentence.

1) Installation and library loading

install.packages("tidytext")
library(tidytext)

2) basic usage of unnest_tokens

The basic function form can be seen like this.

unnest_tokens(data, output_column, input_column, token = "words", ...)

data: The data frame to tokenize.
output_column: The name of the new column in which to store the token.
input_column: Name of the column containing the text to tokenize.
token: Type of token (default is “words”).

example

library(dplyr)
library(tibble)

# example example
data <- tibble(id = c(1, 2), text = c("I love R", "Data science is awesome"))

# example example
tokenized_data <- data %>%
  unnest_tokens(word, text)

# example example
print(tokenized_data)

In this example, we used a tibble dataframe with an id column and a text column. We applied the unnest_tokens() function to tokenize the text in the text column into words, and stored the results in a new word column.

Additional options

drop : If set to FALSE, include the input column in the results.
to_lower: If set to FALSE, it is case sensitive.
strip_numeric , strip_punct , strip_mark , collapse etc: Additional text cleaning options.

This function is very flexible and can be applied to a variety of text data. It can be used with several tokenization options and other tidytext functions to perform more complex text analysis tasks.

Token: This is the basic unit when analyzing text, and can usually be a word, phrase, or sentence.
Tidy Text: This refers to a text data format in which each word (token) forms one line and is stored together with a document or other identifier.

2. unnest_tokens parameter

The unnest_tokens() function has several options that allow you to fine-tune the process of tokenizing text. I'll explain some of the main options below.

1) Basic parameters:

data: The data frame to tokenize.
output_column: The name of the new column in which to store the token.
input_column: Name of the column containing the text to tokenize.
token: Type of token (e.g. “words”, “characters”, etc.)

2) Additional options:

drop: logical type. Whether to remove the input column from the results. The default is TRUE.
to_lower: logical type. Whether to convert all characters to lowercase. The default is TRUE.
strip_numeric: logical type. Whether to remove numbers. The default is FALSE.
strip_punct: logical type. Whether to remove punctuation. The default is FALSE.
collapse: string. Whether to concatenate tokens with this string. The default is NULL.

library(dplyr)
library(tidytext)

# example example
data <- tibble(id = c(1, 2), text = c("I love R", "Data science is awesome"))

# example example, example example example, example example
tokenized_data <- data %>%
  unnest_tokens(word, text, drop = FALSE, to_lower = FALSE)

# example example
print(tokenized_data)

3) Remarks

If you set drop = FALSE, the original input_column will be retained in the result even after tokenization.
If you set to_lower = FALSE, case will be preserved.
If you set strip_numeric = TRUE, numbers will be removed.
If you set strip_punct = TRUE, punctuation will be removed.

By combining these options, you can increase the precision of tokenization or simplify the preprocessing process.

input: The name of the column containing the text to tokenize.
output: The name of the new column in which to store the tokenized results.
token: This is an option that determines what unit to tokenize in. These include ‘words’, ‘characters’, ‘ngrams’, ‘sentences’, ‘lines’, ‘paragraphs’, and ‘regex’.

4) Omitted

You can omit the input and output parameters in the unnest_tokens() function, but in that case the function will use the default settings of the first column of the data frame as input and word as the output column name. So you can also use it in the following form:

text %>%
  unnest_tokens(token = "sentences")

However, if you do this, it can be difficult to clearly understand from code alone which columns are being tokenized and in which columns the tokens are stored. For code readability and maintainability, it is recommended to specify input and output explicitly.

Explicitly specifying column names is recommended because it makes it easier for readers of your code or when modifying your code later to know what the column means.

To download the R program, you can click the download link on the R program's official website (https://www.r-project.org/).

Good article to read together

Text replacement str_replace, str_replace_all functions
str_squish function to remove unnecessary spaces
Understanding Tibble and the as_tibble() function
Execute PHP and R code in conjunction
Importance and usage of pipe operator %>%

Key Checklist

Is the text column to be analyzed clear?
Have you decided which unit to divide into: sentences, words, or n-grams?
Is there a plan to remove stop words and analyze frequencies after tokenization?
Have you confirmed whether morpheme analysis is necessary in processing Korean text?

Good R statistics articles to read together

How to use the R pipe operator %>%: An easy way to read data analysis flows
Research Method Introduction to R Statistics: Understanding research design and analysis methods at a glance
Variables and Measurement R Statistics: Understanding independent variables, dependent variables and measurement levels
What is research: Summary of research concepts for introduction to R statistics

FAQ

What is this article about?

This article is part of Thinknote’s English R statistics and data-analysis archive. It explains research, measurement, text processing, or tidyverse-style workflow concepts in practical language.

How should I use this guide?

Use it as a learning note and starter reference. When applying code, adjust package versions, object names, and dataset structure to your own R environment.

Where can I read the original Korean article?

The original Korean article is available here: Original Korean article.

Thinknote

근로자햇살론 신청 조건 총정리: 대상·한도·금리·보증료 7가지 체크

착한 리더가 팀을 망치는 순간: 오탁민 작가가 말한 팀 장악의 기술

Python Beginner 20-Lesson Course Guide: Learn from Installation to a Mini Project