BERT를 Google Colab에서 돌려보기(TPU 사용)

글에 들어가기 앞서,

Google Colaboratory 사용법을 익히고 싶은 분들은 아래 링크를 참고하시기 바랍니다.

https://jisoo-coding.tistory.com/2

Google Colaboratory 입문자들을 위한 설명!

추후 BERT로 classification하는 문제를 풀어보고 싶은 분들은 아래 링크에 매뉴얼을 참고하여 실습해 보시기 바랍니다. https://jisoo-coding.tistory.com/34 BERT를 Google Colab에서 돌려보기(TPU 사용) 글에..

jisoo-coding.tistory.com

#해결하고자 하는 문제

: 본 글에서는 Google Colaboratory TPU 환경에서 BERT를 이용해 34개의 target을 classification하는 내용을 다루고 있습니다.

- News title(text) classification

- 34 labels

1. 가장 먼저 준비해야 할 요소는 Dataset!

아래 링크에서 다운받으실 수 있습니다!

drive.google.com/drive/folders/1JFrsWLtHDuG7MyL856RX8d2OXurpByZQ?usp=sharing

guid: 인스턴스의 id 값(사실상 큰 중요성 없음)

label: 레이블 값

text_a: No 필요(하나의 텍스트만을 고려할 것이기 때문에 전부 a로 그냥 채워 넣음 -> 어차피 run_classifier.py 코드에서 text_a=None이라는 코드가 있음)

text_b: 학습시킬 text input

header와 같은 순서대로 dataset 생성 (그 이유는 데이터 할당시 column의 index로 구별하기 때문)
header 제거
train.tsv(for training), dev.tsv(for evaluation)로 분리하여 데이터 생성

2. 코드 준비!

Official BERT 코드( https://github.com/google-research/bert )에서 git clone
run_classifier.py 수정
1. 원하는 Processor(여기서는 CoLa 사용)의 return 값을 수정해줌
2. 여기선 label이 34개이기 때문에 34 size의 배열로 코드 수정

(↓↓↓귀찮음러를 위한 코드 제공↓↓↓)

["0","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33"]

3. Google Cloud 환경 생성

Colab에서 TPU로 가속할 때 모델 파라미터를 기존 drive에서는 충돌이 발생

그래서 google cloud 환경에서 bucket을 생성해 놓아야 함

(솔직히 1분컷)

4. Google Drive에 환경 구성

폴더 생성
bert 전체 코드를 폴더 안에 업로드 -> (bert 폴더명) bert_repo로 변경
dataset을 폴더 안에 업로드
colab 파일 생성

5. 드디어 본격적인 Colab 활용

1. 런타임 TPU로 변경

2. google drive 접근 권한

from google.colab import drive

drive.mount('/content/drive/')

3. 현재 위치 변경 (https://jisoo-coding.tistory.com/2 <- 여기서 '디렉토리 변경' 찾으면 나오는 바로 그 코드)

4. 코드 전체 복붙 (Fine-tune and run predictions on a pre-trained Bert~~~ 전까지)

https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb#scrollTo=xtgLSuh8IdGP

Google Colaboratory

colab.research.google.com

5. model_predict 함수에 아래 코드 추가

6. 타겟 확률 리스트(probs)를 활용하여 가장 큰 확률의 인덱스 추출

& submission 형식에 맞추어 output 파일 생성

지수의 네버엔딩 컴공부즈