About 懂中文 Dong Chinese
Developed byPeter Olson.
Blog: 东东's notes
Contact: feedback@dong-chinese.com
Privacy policy|Terms of service
Where do the sentences come from?
Dong Chinese uses a database of 705,493 sentences. The sentences come from several different sources:
Tatoeba (17,355 sentences)
Available underCreative Commons Attribution 2.0 license (CC-BY 2.0)UM-Corpus(29,446 sentences)
Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo Quaresma, Francisco Oliveira, Shuo Li, Yiming Wang, Yi Lu, "UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation". Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, 2014.Education (13,080 sentences)
Microblog (156 sentences)
News (6,055 sentences)
Science (1,341 sentences)
Spoken (8,054 sentences)
Subtitles (760 sentences)
AI Challenger caption dataset (210,000 images with 565,231 captions)
Wu, Jiahong, et al. "Ai challenger: A large-scale dataset for going deeper in image understanding." arXiv preprint arXiv:1711.06475 (2017).AI Challenger translation dataset (91,220 sentences)
Programmatically generated small-vocabulary sentences (2,241 sentences)
How is the percentage of movies and books I understand estimated?
Dong Chinese uses the following data:
Character frequency in movie subtitles
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. Plos ONE, 5(6), e10729.Character frequency in written texts
Da, Jun. 2004. Chinese text computing.
What technologies are used?
Dong Chinese was built with the help of the following libraries, frameworks, and services:
The following open-source libraries were created while developing Dong Chinese:
Miscellaneous attributions
AllSet Learning Pronunciation Wikifor one-syllable audio recordings (CC BY-NA-SA 3.0)