Cox model and decision trees: an application to breast cancer data

ABSTRACT Objective. To evaluate, using semiparametric methodologies of survival analysis, the relationship between covariates and time to death of patients with breast cancer, as well as the determination discriminatory power in the conditional inference tree of patients who had cancer. Methods. A retrospective cohort study was conducted using data collected from medical records of women who had breast cancer and underwent treatment between 2005 and 2015 at the Hospital da Fundação de Assistencial da Paraíba in Campina Grande, State of Paraiba, Brazil. Survival curves were estimated using the Kaplan–Meier method, Cox regression, and conditional decision tree. Results. Women with triple-negative molecular subtypes had a shorter survival time compared to women with positive hormone receptors. The addition of hormone therapy reduced the risk of a patient dying by 5.5%, and the risk of a HER2-positive patient dying was 34.5% lower compared to those who were negative for this gene. Patients undergoing hormone therapy had a median survival time of 4 753 days. Conclusions. This paper shows a favorable scenario for the use of immunotherapy for patients with HER2 overexpression. Further studies could assess the effectiveness of immunotherapy in patients with other conditions, to favor the prognosis and better quality of life for the patient.

Breast cancer is considered one of the main factors influencing mortality in the female population worldwide (1). Studies show an increased incidence of breast cancer in developing countries, due, among other factors, to adoption of an unhealthy lifestyle and increased life expectancy (2).
From 2020 to 2022, the most prominent cancer institute in Brazil, the National Cancer Institute (Instituto Nacional de Câncer-INCA) estimated an occurrence of 625 000 new cases of breast cancer nationwide (3). In this regard, the Northeast region presented a considerable increase in the incidence rate, from approximately 27 new cases per 100 000 women in 2005 to approximately 64 per 100 000 in 2018 (4,5).
Historically, it is known that the mortality rate due to breast cancer is higher in less-developed regions (6). According to the Atlas of Mortality from Breast Cancer, prepared by INCA, while in 2005 the State of Paraiba had 156 deaths from malignant breast cancer, in 2015 this number had risen by almost 60%, reaching 248 (7).
Regarding data analysis on breast cancer, several models have been frequently proposed as alternatives to explain This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 IGO License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited. No modifications or commercial use of this article are permitted. In any reproduction of this article there should not be any suggestion that PAHO or this article endorse any specific organization or products. The use of the PAHO logo is not permitted. This notice should be preserved along with the article's original URL. Open access logo and text by PLoS, under the Creative Commons Attribution-Share Alike 3.0 Unported license. relationships among the variables. The use of survival analysis models aims to describe the probability of survival of individuals under specific conditions (type of treatment, age) after the breast cancer diagnosis. Survival analysis is an area of statistics that aims to analyze the time until the occurrence of a particular event of interest, which is defined as failure or outcome (8). A peculiarity is the possibility of censoring presence, which is the partial observation of the response of interest when the individual does not suffer the event during the study period. It is precisely in the censored observations that survival analysis differs from other analyses, such as logistic regression (9,10).
Concerning the adjustment of survival models, it is known that the use of covariables affects the lifespan of individuals, giving rise to the need to use regression analysis as a way to include this additional information (11). In survival analysis, it is possible to collect variables that represent the existing variability in the population, such as age and sex, among others, for use in regressive models. In these cases, two approaches can be initially adopted: parametric models and semiparametric models (12)(13)(14).
Recursive partitioning for a continuous response, censored, ordered, nominal, and multivariate variables in a conditional inference structure are available in the R party package and partykit (15), which is free of charge and available from https:// cran.r-project.org. The methodology in the party package uses conditional inference as a binary and recursive partitioning method in subsets, and partykit consists of its improved implementation, providing the same approach based on new infrastructure (16). Predictions can be calculated using the partykit package, which returns predicted means, predicted classes, or predicted mean survival times, and more information about the conditional distribution of the response variable, that is, predicted class probabilities or Kaplan-Meier curves, being a viable alternative to Cox modeling (15), described by Breiman et al. (17). According to Xiaogang and Chih-Ling (18), the tree-augmented Cox model assesses and remodels the inadequacies of the classical Cox model; it also adds new understandings of cancer death that were not exposed by the previous Cox regression analysis.
This study focuses on the analysis, through analytical methods, of which variables interfere in the increase in mortality from breast cancer in the region of Campina Grande, located in the State of Paraiba, Brazil. Additionally, the factors that favor the occurrence of censoring favored the process of remission. This fact is justified to aid decision-making for public policies, aiming to reduce the negative impact of the disease.
Thus, the objective of this study was to evaluate the relationship between covariates and time to death of patients with breast cancer, as well as the determination discriminatory power in the conditional inference tree of patients who had cancer.

MATERIALS AND METHODS
This was a retrospective cohort study, with data collected from the medical records of women who had breast cancer.

Sampling
In the simple sample random method, the premise is that each component of the population studied has the same chance of being chosen to compose the sample. The technique that guarantees this equal probability is the random selection of individuals; for example, by drawing lots (19). The equation below shows the calculation for the sample.
. . . In the equation, n is the number of individuals in the sample, N is the estimated population size, p is the population proportion of individuals who belong to the category of interest to be studied, q is the population proportion of individuals who do not belong to the category that the study is interested in, Z a/2 is the value of the significance level α, and E is the margin of error or the maximum estimation error. According to Yu et al. (20), when the population parameters p and q are not known, the sample parameters, p and q are not known either, and the author recommends replacing them by 0.5. So, the formula for calculating the sample size is: (1)

Data collection
Data were obtained from the medical records of patients with breast cancer who underwent treatment at an oncology reference hospital in Campina Grande, State of Paraiba, Brazil, between 2005 and 2015. On average, approximately 200 women underwent breast cancer treatment per year in this hospital. Thus, taking into account the α level of significance of 5% and the margin of error of 7.5%, the minimum sample size should be 158 observations. In this study, 161 observations were used.
The retrospective study was carried out with authorization from the ethics committee (Certificado de Apresentação para Apreciação Ética-CAAE) of the Federal University of Campina, number 97198518.9.0000.5182. In the data collection, the medical records of patients who had breast cancer and underwent treatment at the hospital were anonymized in accordance with Brazil's New Data Law (nova lei de proteção de dados-LGPD). Information was collected directly from medical records, chosen randomly to obtain a probabilistic sample, and so it was not necessary to obtain free and informed consent of the participants. The hospital's approval was requested, as the research was carried out on secondary data (patient records).
The variables collected were: date of the first appointment; date of last appointment; date of patient's death; age; number of doses of hormone therapy, chemotherapy, and radiotherapy; type of tumor; and molecular markers: estrogen receptor, progesterone receptor, Ki67 protein, P53, and HER2.
Patients were categorized by time until death, which was calculated from the date the woman visited the hospital for the first time until the date of her death. The censoring time was the days between the date of admission to the hospital and the date of the last consultation; those patients who did not obtain the event of interest by the end of the study were considered as censored. Hence, the database can be divided into two groups: group 1, comprising those patients who were not censored, due over time to be constant over time. The survival function given the covariate vectors is given by: with the basic survival function defined as: Due to the nonparametric component, Cox (27) formalized the partial maximum likelihood method, eliminating the nonparametric component from the model.

Model selection
Akaike information criterion. The Akaike information criterion (AIC) seeks to adjust the most parsimonious model possible; that is, the model that involves the least probable parameters to be estimated and to explain the behavior of the variable as well as or even better than the response variable of the saturated model (28).
According to Moore (29), one of the best ways to evaluate statistical models is via AIC, which calculates the likelihood of the model being penalized by the number of parameters. The objective is to find the model such that the AIC value is as small as possible, with this value calculated as: where l(b) is the likelihood of the model and k is the number of parameters. For Kimura and Waki (28), the inclusion of variables in the model causes a decrease in the AIC value; however, at some point the criterion starts to increase, indicating that the inclusion of particular variables is unnecessary and will not contribute to parameter estimates. Variable selection. One of the most-used variable selection methods in survival analysis is stepwise (step-by-step selection), which is nothing more than a forward method adjustment (22,30). The procedure has advantages if there are numerous potential explanatory variables, but it is also criticized because it can exclude clinical experience and knowledge in the model-building process (31).
Decision tree. The conditional inference trees estimate a regression relationship by binary recursive partitioning in a conditional inference structure (32,33). The algorithm works in three steps. 1) It tests the value of the global hypothesis of independence between the input variables and the answer (which can also be multivariate), stopping the algorithm if it cannot reject the hypothesis. Otherwise, select the input variable with the strongest association with the answer. The p-value measures this association corresponding to a test for the partial null hypothesis of a single input variable and the answer. 2) It implements a binary division on the selected input variable. 3) It repeats steps 1 and 2 several times. The implementation uses a unified framework for conditional inference or permutation tests. The stopping criterion in step 1 is based on the p-values adjusted by the multiplicity of the univariate p-values (testtype = "Univariate") of the partykit package (16) of the R Core Team software (34). This statistical approach ensures that the right-sized tree is grown without additional (post)pruning or to remission or discontinuation of treatment; and group 2, comprising patients who died due to breast cancer.

Survival analysis
Survival analysis is a branch of statistics that analyzes times until the occurrence of a given event: time that an individual survives a given treatment; time until the development of a disease; or simply time until death (12,14). According to Yu et al. (21), the response variable corresponds to the time until the occurrence of an event of interest. Survival data sets are characterized by failure times, whose important characteristic is the presence of censoring, which represents the partial observation of the response (22,23).
The Kaplan-Meier estimator stands out for estimating survival curves (23). The Kaplan-Meier estimator allows for testing hypotheses that do not require assumptions about the distribution of data (24). It analyzes data measured only on an ordinal scale, which can occur for categorized data measured on a nominal scale and allow the estimation of the survival function incorporating censoring (25,26). The survival function estimated by Kaplan-Meier considers the occurrence of distinct failures in time intervals, where the survival times are ordered; that is, t 1 ≤ t 2 ≤ t 3 ≤...≤ t k , and more than one failure may occur at the same time: i. t 1 ≤ t 2 ≤ t 3 ≤...≤ t k , distinct and ordered times of failures; ii. d j , number of failures up to time t j , j=1,2,...,k; iii. n j , the number of items at risk; that is, individuals did not fail and were not censored until t j .

Cox model
The model presented by Cox (27) is the most used in clinical studies due to its versatility. The survival time does not need to follow a probability distribution and the structure of this model has a nonparametric and a parametric component, justifying its denomination as a semiparametric model (21,22), and it is given by: where g is a non-negative function specified such that g(0)=1. The term l 0 (t) is a non-negative function of time, representing the nonparametric component of the model, which is not specified. This component is usually called the base or basal function. The parametric component is often expressed by: where b is the vector of parameters associated with p covariates. The Cox model has the proportional hazards assumption, which characterizes the failure rate of two different individuals cross-validation. The statistical analyses and graphics were performed using R/RStudio software version 4.1.1 (33).

RESULTS
The results show that the mean age of the patients who were censored was approximately 61 years. The average service time until censoring was around 1 120 days. Half of the patients undergoing treatment received more than 52 doses of hormone therapy. The average number of chemotherapy sessions was approximately six sessions per patient, and about half of the patients had more than 27 radiotherapy sessions (Table 1). Table 2 shows that the mean age of patients who died was approximately 58 years old. Regarding the length of care, the average number of days between the first consultation and death was 1 250 days, with half of the patients dying within 888 days after being admitted to the hospital. Half of the patients received more than five doses of hormone therapy and underwent more than 25 radiotherapy sessions. The average number of chemotherapies was approximately 19 sessions per patient. Figure 1 shows the histogram of time to censoring and time to death. Thus, it is possible to notice that patients were censored more frequently in the first 500 days, while the frequency of patients who died was higher in 2 000 days.
To build the Cox model, the variables used were: location; age; number of doses of hormone therapy, radiotherapy, and chemotherapy; and estrogen and progesterone receptors, HER2, Ki-67, and p53. For the model containing all covariates, the AIC was 110.03. Only the variables number of hormone treatments and HER2 gene positive were significant in this model. Therefore, the stepwise regression variable selection criterion was used to obtain a more consistent model. This new model had an AIC equal to 98.63, which was relatively lower than the initial model, and all covariates were significant. We verified the proportional hazards assumption for a Cox regression model fit (coxph) using the cox.zph function in R. We found that   the variable number of radiotherapy treatments violated the assumption of proportionality. Thus, the stratified Cox model was used, a contiguous fact that the proportionality test of this new model was not significant (p = 0.09) and presented an AIC (35.75) lower than the models previously evaluated. Figure 2 shows the survival curves for variables on tumor location and molecular markers. The results presented in Table 3 highlight that with each addition of a hormone therapy unit, the patient's risk of death decreased by 5.5%, and the risk of death for a patient with positive HER2 was 34.5% lower than those patients who were negative for this gene.
According to Figure 3, the variables number of hormone therapy units and number of radiotherapy presented a high discriminatory power in the conditional inference tree; they also are responsible for the division of nodes, generating branches that are nodes (branches 3, 4, 5). Thus, those patients who had more than 46 hormone therapies performed during treatment have a better cure prognosis (node 5), with a median time of 4 753 days, and probably the high incidence of censoring (about 70%) is due to loss to follow-up because of prolonged treatment.  On the other hand, those patients who underwent fewer than 46 hormone therapies and had fewer than five radiotherapies performed had the worst prognosis, with a median lifespan of 490 days. Those who underwent fewer than 46 hormone therapies and had more than five radiotherapies performed had an intermediate prognosis, with a median lifespan of 1 446 days.

DISCUSSION
The number of chemotherapy sessions performed in women in the censored group was lower than in women who died. Although chemotherapy is one of the efficient ways to treat breast cancer, some studies correlate the number of chemotherapy sessions with the severity of the disease. These results are similar to those found in a study carried out in Canada (35) with 993 patients and evaluating the symptom scores of patients undergoing breast cancer treatments, suggesting that women with more than three-year survival need more aggressive treatment, developing a greater burden of symptoms than those who died in under three years. The log-rank test showed that the estrogen receptor (p < 0.001), progesterone receptor (p = 0.003), and HER2 (p = 0.003) variables showed a significant difference concerning the categories; and women classified as triple-negative-for estrogen receptor, progesterone receptor, and HER2-had a shorter survival time compared to women who were positive for these characteristics. This is because women with triple-negative subtype do not benefit from hormonal therapies or immunotherapies, but instead depend exclusively on surgery, more aggressive chemotherapy, and radiotherapy, and so they have a worse prognosis and shorter survival (36).
Those patients with HER2 overexpression presented the worst prognosis, as it is an accelerated tumor growth factor. Nonetheless, there is a specific immunotherapy for these women, which is recommended for HER2 factor/neu inhibition and is also used in patients with metastases (37). In this regard, the incorporation report prepared by the Ministry of Health of Brazil (38) suggests that immunotherapy has an advantage in the treatment of patients with melanoma at any stage when compared to target therapies. Notwithstanding, its cost remains high, preventing its implementation for the treatment of patients with advanced, non-surgical, and metastatic melanoma (38).
The results indicated that the longer the woman's survival, there is possibly a worsening of the quality of life, related to the therapies offered. As the disease progresses, treatment goals can be modified to focus on the comfort of the patient, such as providing palliative care to ensure a better quality of life (39).
The survival models of decision tree analysis offer many advantages over Cox regression, such as explicit maximization of predictive accuracy, parsimony, statistical robustness, and transparency (40). Therefore, researchers interested in rigorous predictions and clear decision rules should consider developing models using the survival framework of classification trees (40). Weathers (41) applied three techniques to five publicly available datasets and compared their fits using prediction error curves and the concordance index. The author identified "types of data" in which random survival forests and conditional inference trees (ctrees) may be expected to surpass the Cox model.
The limitations of this study include the lack of a date of diagnosis, which results in a gap of days between diagnosis and admission to hospital. In addition, the histological characteristics of the tumor were not taken into account, and these may have accelerated death in some women, as they indicate tumors that are more aggressive or do not respond to the chemotherapy applied. This status would be valuable in building on the results of this research.

Conclusion
We concluded that women with triple-negative molecular subtypes for breast cancer have a shorter survival time, correlated with others with positive hormone receptors.
The study presents a favorable scenario for the use of immunotherapy as a therapy for patients with HER2 overexpression. Due to its high cost, in Brazil's Unified Health System (Sistema Único de Saúde-SUS) this therapy is only used in patients with HER2 or metastasis. Thus, further studies could assess the effectiveness of immunotherapy in patients with other conditions, as well as the cost-effectiveness for the SUS of implementing this treatment on a larger scale, to favor the prognosis and better quality of life for the patient.
Author contributions. LCP was responsible for the investigation, methodology, and writing the original draft; SJS was responsible for the data collection, methodology, and writing the original draft; CRF was responsible for the methodology, validation, and writing the original draft; ALB was responsible for the methodology and statistics analysis for the original draft; SFAXJ was responsible for the formal analysis, software, investigation, data management, and writing the original draft; LSSA was responsible for the health conceptualization, methodology, and writing the original draft; MECO was responsible for the health conceptualization, methodology, investigation, and writing the original draft; TAO was responsible for the writing, review, editing, and supervision. All authors reviewed and approved the final version.
Conclusiones. Este estudio muestra un escenario favorable para el uso de la inmunoterapia en las pacientes con sobreexpresión del HER2. En futuros estudios se podría evaluar la eficacia de la inmunoterapia en pacientes con otras enfermedades, con el fin de favorecer el pronóstico y mejorar la calidad de vida de la paciente.