2017 HCCLatSemEval2017Task2Combining

(He et al., 2017) ⇒ Junqing He, Long Wu, Xuemin Zhao, and Yonghong Yan. (2017). “HCCL at SemEval-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity.” In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval@ACL 2017).

Subject Headings: HCCL; SemEval-2017; SemEval-2017 Task 2; Semantic Word Similarity System; Semantic Word Similarity Benchmark Task; Multilingual And Cross-Lingual Semantic Word Similarity System; Machine Translation.

Notes

Cited By

Google Scholar: ~ 2 Citations

Quotes

Abstract

In this paper, we introduce an approach to combining word embeddings and machine translation for multilingual semantic word similarity, the task2 of SemEval-2017. Thanks to the unsupervised transliteration model, our cross-lingual word embeddings encounter decreased sums of OOVs. Our results are produced using only monolingual Wikipedia corpora and a limited amount of sentence-aligned data. Although relatively little resources are utilized, our system ranked 3rd in the monolingual subtask and can be the 6th in the cross-lingual subtask.

1. Introduction

...

In this task, we adopt different strategies for the two subtasks. We use word2vec for subtask1, monolingual word similarity. For the subtask2, cross-lingual word similarity, we use jointly optimized cross-lingual word representation in addition to transliteration model. We build a crosslingual word embedding system and a special machine translation system. Our approach has the following characteristics:

Fast and efficient. Both word2vec and the cross-lingual word embeddgings tool have impressive speed (Coulmance et al., 2015) and not need expensive annotated word-aligned data.
Decreasing OOVs. Our translation system is featured by its transliteration model that deal with OOVs outside the parallel corpus.

We constructed a naive system and did not try out the parameters for embeddings and translation models in limited time.

2. Our Approach

We use skip-gram word embeddings directly for monolingual subtask. For cross-lingual subtask, we use English as pivot language and train multilingual word embeddings using monolingual corpora and sentence-aligned parallel data. A translation model is also trained by our statistical machine translation system. Subsequently, we translate the words in the test set into English and look up their word embeddings. For those out of English word embeddings, we check them from original language word embeddings.

...

3. Experiments

4. Results

5. Conclusion

Acknowledgments

Footnotes

References

BibTeX

@inproceedings{2017_HCCLatSemEval2017Task2Combining,
  author    = {Junqing He and
               Long Wu and
               Xuemin Zhao and
               Yonghong Yan},
  editor    = {Steven Bethard and
               Marine Carpuat and
               Marianna Apidianaki and
               Saif M. Mohammad and
               Daniel M. Cer and
               David Jurgens},
  title     = {HCCL at SemEval-2017 Task 2: Combining Multilingual Word Embeddings
               and Transliteration Model for Semantic Similarity},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation
               (SemEval@ACL 2017)},
  pages     = {220--225},
  publisher = {Association for Computational Linguistics},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/S17-2033},
  doi       = {10.18653/v1/S17-2033},
}

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2017 HCCLatSemEval2017Task2Combining	Junqing He Long Wu Xuemin Zhao Yonghong Yan			HCCL at SemEval-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity						2017