[자연어처리(NLP)] 2. 자연어 처리 진행 순서

[자연어처리(NLP)] 2. 자연어 처리 진행 순서_2(실습)

ITselfhiam

|2024. 1. 17. 16:48

데이터 전처리 실습

1. 필요모듈 설치

!pip install newspaper3k

※ 뉴스 기사 크롤링 라이브러리 입니다.

2. 필요모듈 임포트

import newspaper

3. 지원되는 언어 확인

newspaper.languages()

4. Article 임포트

from newspaper import Article

5. 원하는 뉴스의 링크 + 객체 생성

url = 'https://v.daum.net/v/ySnyNE6FqA'

# 객체 생성
article = Article(url, language='ko')

6. 해당 링크를 기준으로 다운로드

article.download()
article.parse()

print('title', article.title)
print('content', article.text)

7. 임의이 데이터 추가 및 확인

additional_info = [
    "✼ 기자 김사과(apple@apple.com) 취재 반하나(banana@banana.com)",
    "<h2> '기생충' 봉준호가 제작하고 '미나리' 정이삭이 감독하는 신작 </h2>",
    "이 기사는 임시 데이터임을 알립니다 ... ",
    "Copyright@koreait.com",
    "<br> 👉 이 기사는 문화 섹션으로 분류했습니다 ... </br>",
    "#기사 #문화 #기생충 #미나리"
]

context = article.text.split('\n')
context += additional_info

for i, text in enumerate(context):
    print(i, text)

8. 불용어 제거 후 출력하기

delete_stopword() : 데이터를 넣으면 불용어를 제거한 후 데이터를 반환합니다.

stopwords = ['👉', '✼', '...']

def delete_stopwords(context):
    preprocessed_text = []
    for text in context : 
        text = [w for w in text.split(' ') if w not in stopwords]
        preprocessed_text.append(' '.join(text))
    return preprocessed_text
    
preprocessed_context = delete_stopwords(context)

for i, text in enumerate(preprocessed_context):
    print(i, text)

9. 불용어 HTML 코드 제거 후 출력하기

delete_html_tag() : 데이터를 넣으면 HTML을 제거한 후 데이터를 반환합니다.

import re

def delete_html_tag(context) : 
    preprocessed_text = []
    html_tag = re.compile('<.*?>')

    for text in context : 
        text = re.sub(html_tag, '', text).strip() # 적용
        if text : 
            preprocessed_text.append(text)
    return preprocessed_text
    
preprocessed_context = delete_html_tag(preprocessed_context)

for i, text in enumerate(preprocessed_context):
    print(i, text)

10. 문장 분리하기

학습 데이터를 구성할 때 입력 데이터의 단위를 설정하기 애매해지므로 문장 단위로 모델이 학습하도록 유도하기 위해 문장 분리가 필요 합니다.
한국어 문장 분리기 중 kss 라이브러리(https://github.com/hyunwoongko/kss)를 사용합니다.

10-1. 필요파일 install

# kss 설치
!pip install kss

# 오류 방지를 위해 python-mecab-kor 설치
!pip install python-mecab-kor

11. kss 임포트

import kss

12. 문장 분리 후 출력하기

def sentence_seperator(context) : 
    splited_context = []

    for text in context : 
        text = text.strip()
        if text : 
            splited_text = kss.split_sentences(text)
            splited_context.extend(splited_text)
    return splited_context
    
preprocessed_context = sentence_seperator(preprocessed_context)

for i, text in enumerate(preprocessed_context):
    print(i, text)

13. 불용어 이메일 제거 후 출력하기

def delete_email(context):
    preprocessed_text = []
    for text in context:
        text = re.sub('[a-zA-Z0-9+-_.]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', text).strip()
        if text:
            preprocessed_text.append(text)
    return preprocessed_text
    
preprocessed_context = delete_email(preprocessed_context)
for i, text in enumerate(preprocessed_context):
    print(i, text)

14. 불용어 해시태그 제거 후 출력하기

def delete_hashtag(context) : 
    preprocessed_text = []
    for text in context : 
        text = re.sub('#\S+', '', text).strip()
        if text : 
            preprocessed_text.append(text)
    return preprocessed_text

preprocessed_context = delete_hashtag(preprocessed_context)
for i, text in enumerate(preprocessed_context):
    print(i, text)

'Study > 자연어처리[NLP]' 카테고리의 다른 글

[자연어처리(NLP)] 3. 임베딩_1(이론) (0)	2024.01.18
[자연어처리(NLP)] 2. 자연어 처리 진행 순서_1(이론) (0)	2024.01.17
[자연어처리(NLP)] 1. 자연어 처리 개요_2(워드 클라우드, Task) (0)	2024.01.16
[자연어처리(NLP)] 1. 자연어 처리 개요_1(Hannanum, Kkma, Komoran, Okt) (0)	2024.01.16

[자연어처리(NLP)] 2. 자연어 처리 진행 순서_2(실습)

데이터 전처리 실습

1. 필요모듈 설치

2. 필요모듈 임포트

3. 지원되는 언어 확인

4. Article 임포트

5. 원하는 뉴스의 링크 + 객체 생성

6. 해당 링크를 기준으로 다운로드

7. 임의이 데이터 추가 및 확인

8. 불용어 제거 후 출력하기

9. 불용어 HTML 코드 제거 후 출력하기

10. 문장 분리하기

10-1. 필요파일 install

11. kss 임포트

12. 문장 분리 후 출력하기

13. 불용어 이메일 제거 후 출력하기

14. 불용어 해시태그 제거 후 출력하기

'Study > 자연어처리[NLP]' 카테고리의 다른 글

티스토리툴바