Background: The centralized National Health Insurance (NHI) agency covers over 99% of the people in Taiwan, which makes its National

Health Insurance Research Database (NHIRD) a rich data source with which to predict hepatocellular carcinoma (HCC) risk, as well as discover additional risk factors for HCC.

Methods: [Data Sources] We obtained a two-million people, randomly sampled dataset consisting of clinic visits between 1999 to 2013 from NHIRD. The mean number of clinic visits per person in Taiwan was 12.2 times in 2010, while in the USA it was 4.0 times. We selected two subsets to make our predictive models, 1). 10,000 patients with HCC and 40,000 age and gender matched patients without HCC. 2). 10,000 patients with HCC and 40,000 randomly selected patients without HCC. [Machine Learning] We used an one hidden layer artificial neural network (ANN) to predict HCC risk and used 5-fold cross validation to minimize overfitting. To discover the important features, we used both Random Forest and Stepwise selection by ANN. [Data processing] ICD9 were used to represent clinical diagnosis results. Once a patient was diagnosed with a certain ICD9 on a certain day, “1” was assigned to the ICD9-Day matrix. We summed all the assigned value of each ICD9 code for each person in M years, and normalized the summation to 0~1 according to the sum of that ICD9 in all patients to be the input of ANN. To train M years data and predict N years ahead of time, M years data which is N years before the index day were required.

Results: [Prediction] By using three years of training data with 0.5~3 years of lead time before the HCC index date for prediction, in the non-matched dataset, the Area Under ROC (AUROC) of lead time 0.5 year was 0.87; 3 year was 0.85. In the age- and gender-matched dataset, the AUROC of 0.5 year was 0.75; 3 year was 0.72. [Important Factors] The important factors of HCC selected by Random Forest and Stepwise ANN are similar. Top five factors are 1. Chronic liver disease, 2. Age, 3. Screening for malignant neoplasms, 4. Gender, 5. Viral hepatitis. Among these, “Screening for malignant neoplasms” was negatively correlated with HCC because HCC patients stopped counting after the diagnosis, while the non-HCC patients might have been screened many times and continued to be counted.

Conclusion: This prediction model can be used to alert high-risk patients to do further examination or change their lifestyles using only insurance claims data. Compared to the previous studies, e.g. QCancer 10-Year Risk, this study applied not only dichotomous data but also to the frequency of disease information. Diagnosis history can also be used to extract HCC risk factors for further investigation. Finally, this methodology can be applied to predict other cancers using without much recoding.



Author: Chia-wei Liang

Status: Completed Work