The Foreign Service Institute’s School of Language Studies (SLS) classified Chinese as one of its five Category IV (super-hard) languages, stating a typical native English speaker will require at least 88 weeks and 2200 class hours to be fluent.

Mandarin’s tonal nature underlines its onerous; this means reading the same character with different tones, will completely change its meaning. Coupled with a complex writing system and a galaxy of unique characters which formed into words through exclusive combinations, the language is extending its challenge to AI.

Present Technical Challenges

Word segmentation is the requisite to electronic medical records (EMRs) analysis and the first hurdle which Chinese language pose. Typical tools like NLPIR developed by Chinese Academy of Sciences or THULAC by Tsinghua University does not encompass many clinical terms. Drug name like Pseudoephedrine (i.e., qu-jia-wei-ma-huang-jian) will be read as three separate characters and a word: “Qu”, “Jia”, “Wei”, “MaHuangJian”, instead of a drug name.

Like English speaking physicians, Chinese speaking physicians are also prone to using short forms. For example, some physicians may write upper respiratory tract infection (i.e., shang-hu-xi-dao-gan-ran) as upper infection (i.e., shanggan) and this tends to confuse Named Entity Recognition (NER) platforms as they are unable to single out a particular medical event.

This is further complicated by physicians’ sparse descriptions. Green, yellow, brownish, sticky, transparent, watery, foam-like can all be referred to phlegm. Mere semantic feature analysis becomes inadequate or irrelevant. Normalized Google Distance (NGD) may overcome the embarrassment momentarily as it estimates the relationship between a medical event and its related descriptions but according to a group of researchers from Shanghai Jiatong University and AstraZeneca, information extraction from Chinese EMRs is still a relatively unexplored domain.

Standardization of EMRs

China’s Ministry of Health published the general guidelines for medical recording in 2010, to standardize the ways digital health data is being recorded. The guidelines were revised in the following year, taking recommendations from hospitals and IT companies like Neusoft and DHC software.

At the same time, a grading system mimicking the Electronic Medical Records Adoption Model proposed by US’ Healthcare Information and Management Systems Society (HIMSS) was introduced to evaluate eHealth standards across China.

According to HIMSS Analytics, in 2017, there were a total of 31 hospitals attaining the highest Grade 6 and Grade 7 levels in China. This means the institutions not only have a matured in-house EMR platforms but are also sharing useful data with other establishments to improve the overall healthcare system.

In spite of the brief success, moving onto a more structured data recording may cost physicians more time on documenting medical records rather than spending with patients. Besides, not all medical records are text base (i.e., X-ray) and incorporating them into the system may pose a new stream of technical demands.


Author Bio

Hazel Tang A science writer with data background and an interest in the current affair, culture, and arts; a no-med from an (almost) all-med family. Follow on Twitter.