자연어 처리 및 텍스트 분석에 C++를 사용하는 방법은 무엇입니까?-C++-php.cn

자연어 처리 및 텍스트 분석에 C++를 사용하는 방법은 무엇입니까?

WBOY

풀어 주다： 2024-06-03 18:06:01

원래의

908명이 탐색했습니다.

자연어 처리에 C++를 사용하려면 Boost.Regex, ICU 및 pugixml 라이브러리를 설치해야 합니다. 이 기사에서는 단어를 어근 단어로 줄이는 형태소 분석기의 생성과 텍스트를 단어 빈도 벡터로 나타내는 단어주머니 모델에 대해 자세히 설명합니다. 단어 분할, 형태소 분석 및 단어 가방 모델을 사용하여 텍스트를 분석하고 분할된 단어, 단어 줄기 및 단어 빈도를 출력하는 방법을 보여줍니다.

자연어 처리 및 텍스트 분석에 C++를 사용하는 방법은 무엇입니까?

C++를 사용한 자연어 처리 및 텍스트 분석

자연어 처리(NLP)는 컴퓨터를 사용하여 인간 언어를 처리, 분석, 생성하는 등의 작업을 수행하는 학문입니다. 이 기사에서는 NLP 및 텍스트 분석을 위해 C++ 프로그래밍 언어를 사용하는 방법을 설명합니다.

필요한 라이브러리 설치

다음 라이브러리를 설치해야 합니다.

Boost.Regex
ICU for C++
pugixml

Ubuntu에 이러한 라이브러리를 설치하는 명령은 다음과 같습니다.

sudo apt install libboost-regex-dev libicu-dev libpugixml-dev

로그인 후 복사

Create Stemer

형태소 분석기는 단어를 어근 단어로 줄이는 데 사용됩니다.

#include <boost/algorithm/string/replace.hpp>
#include <iostream>
#include <map>

std::map<std::string, std::string> stemmer_map = {
    {"ing", ""},
    {"ed", ""},
    {"es", ""},
    {"s", ""}
};

std::string stem(const std::string& word) {
    std::string stemmed_word = word;
    for (auto& rule : stemmer_map) {
        boost::replace_all(stemmed_word, rule.first, rule.second);
    }
    return stemmed_word;
}

로그인 후 복사

단어주머니 모델 만들기

단어주머니 모델은 텍스트를 단어 빈도 벡터로 표현하는 모델입니다.

#include <map>
#include <string>
#include <vector>

std::map<std::string, int> create_bag_of_words(const std::vector<std::string>& tokens) {
    std::map<std::string, int> bag_of_words;
    for (const auto& token : tokens) {
        std::string stemmed_token = stem(token);
        bag_of_words[stemmed_token]++;
    }
    return bag_of_words;
}

로그인 후 복사

실제 사례

다음은 위 코드를 사용한 텍스트 분석 데모입니다.

#include <iostream>
#include <vector>

std::vector<std::string> tokenize(const std::string& text) {
    // 将文本按空格和句点分词
    std::vector<std::string> tokens;
    std::istringstream iss(text);
    std::string token;
    while (iss >> token) {
        tokens.push_back(token);
    }
    return tokens;
}

int main() {
    std::string text = "Natural language processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages.";

    // 分词并词干化
    std::vector<std::string> tokens = tokenize(text);
    for (auto& token : tokens) {
        std::cout << stem(token) << " ";
    }
    std::cout << std::endl;

    // 创建词袋模型
    std::map<std::string, int> bag_of_words = create_bag_of_words(tokens);
    for (const auto& [word, count] : bag_of_words) {
        std::cout << word << ": " << count << std::endl;
    }
}

로그인 후 복사

출력:

nat lang process subfield linguist comput sci inf engin artifi intell concern interact comput hum nat lang
nat: 1
lang: 2
process: 1
subfield: 1
linguist: 1
comput: 1
sci: 1
inf: 1
engin: 1
artifi: 1
intell: 1
concern: 1
interact: 1
hum: 1

로그인 후 복사

위 내용은 자연어 처리 및 텍스트 분석에 C++를 사용하는 방법은 무엇입니까?의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!