(3) [Python/NLP] 텍스트 전처리

1. XML을 DataFrame으로 변환하기
2. 데이터 전처리(Preprocessing) 하기
1) 널 처리
2) 텍스트 전처리
참고자료

이전 글에서 KCI에서 제공하는 OpenAPI를 사용해 데이터를 받아오는 작업까지 했다.

이번 글에서는 받아온 데이터를 모델이 학습할 수 있도록 전처리 해 볼 것이다.

1. XML을 DataFrame으로 변환하기

우선 받아온 데이터의 태그를 살펴보자

여기서 내가 쓰려고 하는 태그명만 따로 name_list에 저장하고 해당 태그의 내용을 data_list에 담아 저장했다.

# 데이터 확인 후 필요한 열만 리스트화 하여 정보 추출  
name_list = [
  'pub-year'
  ,'pub-mon'
  ,'article-title-original'
  ,'article-title-english'
  ,'author'
  ,'abstract-original'
  ,'abstract-english'
  ,'url'             
]

data_list = [] # 데이터 담는 리스트
row_list = [] # 행 담는 리스트

for row in rows:
  col = row.find_all()
  row_list = [0]*8 # row_list 초기화

  for i in range(0, len(col)):

    name = col[i].name
    lang = col[i]['lang'] if col[i].has_attr('lang')  else ''
    text = col[i].text
    # name_list의 열에 맞춰 해당하는 text를 row_list에 저장
    if name == 'pub-year': row_list[0] = text
    elif name == 'pub-mon': row_list[1] = text
    elif name == 'url': row_list[7] = text
    elif name == 'author' and row_list[4] == 0: row_list[4] = text
    elif name == 'article-title':
      if lang == 'original': row_list[2] = text
      elif lang == 'english': row_list[3] = text
    elif name == 'abstract':
      if lang == 'original': row_list[5] = text
      elif lang == 'english': row_list[6] = text

  data_list.append(row_list) # row_list를 data_list에 저장
  row_list=[] # row_list 비우기

제목과 초록의 경우, 한국어와 영어의 차이가 있기 때문에 두 개를 각각 저장하였다.

저자는 여러 사람이 있는 경우, 첫번째 저자만 저장하도록 했다.

제대로 저장 되었는지 리스트를 출력해보자

이렇게 만든 리스트를 pandas.DataFrame으로 만들어준다.

import pandas as pd

#xml값 DataFrame으로 만들기  
df = pd.DataFrame(data_list, columns=name_list)
df.head(5)

# df.head(n) 외에도 df.describe(), df.info() 등을 통해 dataframe 확인 가능

2. 데이터 전처리(Preprocessing) 하기

1) 널 처리

데이터들은 모델을 통과하기 전에 데이터 전처리를 하여 모델에 적합한 형태로 만들어 줘야 한다.

그 중 널(null) 처리는 텍스트 데이터 뿐만 아니라 다른 모든 형태의 데이터에 필요한 처리 방식이다.

생성한 dataframe에 null값이 있는 행이 있는지 확인해보자

# dataframe 정보 확인하기
df.info()

다행히 내가 생성한 dataframe에는 null인 행이 없어 null 처리를 생략할 수 있었다.

2) 텍스트 전처리

분석에 사용할 논문 초록을 살펴보자

'This paper collects and examines Korean neologisms related to COVID-19 and analyzes their usage patterns. It also considers the methodology for the study of topic-specific neologisms. In order to achieve this, the paper follows two research frameworks: 1) “from COVID-19 to language”; 2) “from language to COVID-19”. The former examines the impact of COVID-19 on language and explores methods of collecting of COVID-19-related neologisms, for which it was first necessary to define the category of COVID-19 neologisms and discuss their distinguishing criteria. The second research framework explores how language informs us on the COVID-19 situation. The analysis of the time when a given COVID-19 neologism first appeared, the changes in occurrence frequency between January and July 2020, and the characteristics of the semantic categories shed light on the changes in South Korea’s politics, society, economy, culture and the overall lifestyle of Koreans owing to COVID-19. The value of this paper not only consists of examining how COVID-19 neologisms embody the introduction of and changes in political measures, the public perceptions, and the general issues regarding culture, society and economy, but also resides in our discussion of the methodology for extracting topic-specific neologisms and related issues.', 'This paper collects and examines Korean neologisms related to COVID-19 and analyzes their usage patterns. It also considers the methodology for the study of topic-specific neologisms. In order to achieve this, the paper follows two research frameworks: 1) “from COVID-19 to language”; 2) “from language to COVID-19”. The former examines the impact of COVID-19 on language and explores methods of collecting of COVID-19-related neologisms, for which it was first necessary to define the category of COVID-19 neologisms and discuss their distinguishing criteria. The second research framework explores how language informs us on the COVID-19 situation. The analysis of the time when a given COVID-19 neologism first appeared, the changes in occurrence frequency between January and July 2020, and the characteristics of the semantic categories shed light on the changes in South Korea’s politics, society, economy, culture and the overall lifestyle of Koreans owing to COVID-19. The value of this paper not only consists of examining how COVID-19 neologisms embody the introduction of and changes in political measures, the public perceptions, and the general issues regarding culture, society and economy, but also resides in our discussion of the methodology for extracting topic-specific neologisms and related issues.'

일단 나는 고려해야 할 부분 크게 3가지로 보았다

문장 부호
대소문자
의미를 담고 있지 않는 형태소들 (the, a, an, of 등)

다른 글들을 찾아보니 stemming 또는 lemmatization 작업이라고 형태소와 무관하게 해당 단어의 근원이 되는 단어를 처리하는 작업도 있었지만 이 부분은 생략했다.

import pandas, nltk  
from nltk.tokenize import RegexpTokenizer  
nltk.download('punkt')  
nltk.download('stopwords')  
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tokenizer = RegexpTokenizer(r'\[a-zA-Z\]+')  
stop_words = set(nltk.corpus.stopwords.words('english') )

def nltk_tokenizer(_wd):  
lower = _wd.lower() # 소문자로 전환  
no_punctuation = tokenizer.tokenize(lower) # 문장부호 제거 및 토큰화  
filtered_tokens = [token for token in no_punctuation if token not in stop_words] # stopwords 제거  
return filtered_tokens

df['abstract_token'] = df['abstract-english'].apply(nltk_tokenizer)

print("Abstract Tokens:")  
print(df['abstract_token'][0])

출력 결과

Abstract Tokens:


['paper', 'collects', 'examines', 'korean', 'neologisms', 'related', 'covid', 'analyzes', 'usage', 'patterns', 'also', 'considers', 'methodology', 'study', 'topic', 'specific', 'neologisms', 'order', 'achieve', 'paper', 'follows', 'two', 'research', 'frameworks', 'covid', 'language', 'language', 'covid', 'former', 'examines', 'impact', 'covid', 'language', 'explores', 'methods', 'collecting', 'covid', 'related', 'neologisms', 'first', 'necessary', 'define', 'category', 'covid', 'neologisms', 'discuss', 'distinguishing', 'criteria', 'second', 'research', 'framework', 'explores', 'language', 'informs', 'us', 'covid', 'situation', 'analysis', 'time', 'given', 'covid', 'neologism', 'first', 'appeared', 'changes', 'occurrence', 'frequency', 'january', 'july', 'characteristics', 'semantic', 'categories', 'shed', 'light', 'changes', 'south', 'korea', 'politics', 'society', 'economy', 'culture', 'overall', 'lifestyle', 'koreans', 'owing', 'covid', 'value', 'paper', 'consists', 'examining', 'covid', 'neologisms', 'embody', 'introduction', 'changes', 'political', 'measures', 'public', 'perceptions', 'general', 'issues', 'regarding', 'culture', 'society', 'economy', 'also', 'resides', 'discussion', 'methodology', 'extracting', 'topic', 'specific', 'neologisms', 'related', 'issues']

참고자료

Text Preprocessing With NLTK

NLTK Tutorial for NLP preprocessing: lowercase, removing punctuation, tokenization, stopword filtering, stemming, and part-of-speech…

towardsdatascience.com

02) 정제(Cleaning) and 정규화(Normalization)

코퍼스에서 용도에 맞게 토큰을 분류하는 작업을 토큰화(tokenization)라고 하며, 토큰화 작업 전, 후에는 텍스트 데이터를 용도에 맞게 정제(cleaning) 및 정 ...

wikidocs.net

11) 문서 벡터를 이용한 추천 시스템(Recommendation System using Document Embedding)

문서들을 고정된 길이의 벡터로 변환한다면 벡터 간 비교로 문서들을 서로 비교할 수 있습니다. 각 문서를 **문서 벡터로 변환하는 방법은** 이미 구현된 패키지인 Doc2Ve ...

wikidocs.net

728x90

저작자표시 비영리 (새창열림)

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

(3) [Python/NLP] 텍스트 전처리

1. XML을 DataFrame으로 변환하기

2. 데이터 전처리(Preprocessing) 하기

1) 널 처리

2) 텍스트 전처리

참고자료

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역