Introduction

Pre-trained word representations are a key component in many neural language understanding models.
Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence.
We use vectors derived from a bidirectional LSTM that is trained with a coupled language model (LM) objective on a large text corpus.
In many experiments, the ELMo representation has been shown to be very excellent, and the error rate is relatively reduced by 20%

Pretrained word vector의 활용이 표준화 되었지만, 하나의 단어에 하나의 벡터를 부여하다보니 context-independent한 문제가 있었다.
워드 임베딩을 풍부하게 하기 위해, subword information을 활용하거나 다의어의 경우 의미별로 다른 벡터를 학습시키는 방법이 등장하였다.
- context2vec
- CoVe
이전 연구에 의하면 biRNN의 서로 다른 레이어가 다른 형태의 정보를 인코딩하는데, 본 연구에서도 유사한 효과가 나타났다.

Model

ELMo
- ELMo word representations are functions of the entire input sentence.
- They are computed on top of two-layer biLMs with character convolutions, as a linear function of the internal network states.
- This setup allows us to do semi-supervised learning, where the biLM is pre-trained at a large scale and easily incorporated into a wide range of existing neural NLP architectures.

Using biLMs for supervised NLP tasks
- 기존의 임베딩 벡터와 함께 사용된다.
- ELMo 표현을 만드는데 사용된 사전 훈련된 언어 모델의 가중치는 고정시키고, 각 층의 가중치와 스칼라 파라미터는 훈련 과정에서 학습된다.

We have introduced a general approach for learning high-quality deep context-dependent representations from biLMs, and shown large improvements when applying ELMo to a broad range of NLP tasks.
Through ablations and other controlled experiments, we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about words in-context, and that using all layers improves overall task performance.