懂中文 Dong Chinese - Learn Mandarin Chinese

Developed byPeter Olson.

Dong Chinese uses a database of 705,493 sentences. The sentences come from several different sources:

Tatoeba (17,355 sentences)
Available underCreative Commons Attribution 2.0 license (CC-BY 2.0)
UM-Corpus(29,446 sentences)
Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo Quaresma, Francisco Oliveira, Shuo Li, Yiming Wang, Yi Lu, &quotUM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation". Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, 2014.
- Education (13,080 sentences)
- Microblog (156 sentences)
- News (6,055 sentences)
- Science (1,341 sentences)
- Spoken (8,054 sentences)
- Subtitles (760 sentences)
AI Challenger caption dataset (210,000 images with 565,231 captions)
Wu, Jiahong, et al. &quotAi challenger: A large-scale dataset for going deeper in image understanding.&quot arXiv preprint arXiv:1711.06475 (2017).
AI Challenger translation dataset (91,220 sentences)
Programmatically generated small-vocabulary sentences (2,241 sentences)

Dong Chinese uses the following data:

Character frequency in movie subtitles
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. Plos ONE, 5(6), e10729.
Character frequency in written texts
Da, Jun. 2004. Chinese text computing.

Dong Chinese was built with the help of the following libraries, frameworks, and services:

The following open-source libraries were created while developing Dong Chinese:

Miscellaneous attributions