Application of Regression Tree Method for Different Data from Animal Science
Application of Regression Tree Method for Different Data from Animal Science
Yusuf Koc1, Ecevit Eyduran1* and Omer Akbulut2
1Department of Animal Science, Agricultural Faculty, Igdir University, Igdir, Turkey
2Department of Actuarial Science, Faculty of Science, Ataturk University, Erzurum, Turkey
The aim of this study was to evaluate predictive performances of CHAID, Exhaustive CHAID, and CART regression tree methods for different combinations of parent node: child node in the data set regarding animal science. To achieve the aim, 1884 Mengali lambs were provided for predicting weaning weight from sex (male and female), birth type (single and twin), birth year (2005, 2006, 2007, 2008 and 2009), farm (Research station, Mastung, Quetta, and Noshki), birth weight, dam age, and dam weight. To choice the best regression tree method, goodness of fit criteria such as coefficient of determination (R2%), adjusted coefficient of determination (Adj-R2%), coefficient of variation (%), SD ratio, relative approximation error (RAE), Root Mean Square Error (RMSE), Pearson correlation between actual and predicted weaning weights were estimated for each combination. It was determined that CHAID algorithm constructed more suitable tree structures, biologically in comparison to Exhaustive CHAID and CART data mining algorithms. Consequently, it is recommended that the biological suitability of the constructed tree structure should be taken account together with estimating model quality criteria.
Received 27 July 2016
Revised 10 August 2016
Accepted 09 September 2016
Available online 25 March 2017
The article is summarized from the first author’s (YK) MSc Thesis. EE and OA interpreted the data, drafted and carefully revised the manuscript. All authors read and approved the final manuscript.
CART, CHAID, Exhaustive CHAID, Regression tree, Data Mining, Weaning weight
* Corresponding author: firstname.lastname@example.org
0030-9923/2017/0002-0599 $ 9.00/0
Copyright 2017 Zoological Society of Pakistan
In animal breeding, it is very prominent to survey the interrelationship between body morphological characteristics and yield characteristics viz. meat, milk and egg. On the other hand, it is essential to ascertain the effect of non-genetic factors affecting the examined yield characteristics, and in the scope of indirect selection, it is notable to exhibit causal relationship between economic yield characteristics and their related quantitative characteristics. Several examples for the causal relationship are the prediction of body weight from body and testicular characteristics, and the prediction of milk yield from udder traits, and the prediction of the spermatological traits from testicular traits, and so on. The main objective in the relational mechanism is to gain better offspring from parent generation in yield traits.
In animal science, the causal relationships can be revealed by several statistical approaches, simple linear regression analysis, multiple linear regression analysis, use of factor analysis scores in multiple regression analysis, use of principal component analysis scores in multiple regression analysis, Path Analysis and Regression Tree Analysis (). However, general linear models have been widely used in the identification of significant factors on yield traits ().
Regression tree analysis, one of the analysis methods for evaluating animal data, is thought as an alternative to the above mentioned methods () and it is a non-parametric analysis method partitioning the population into relationships among independent variables playing major role for homogenous subsets and identifying curve linear and interaction in the explanation of the variability in yield trait, a dependent variable (). The preferability of the decision tree method is due to having more advantageous in multicollinearity, outliers and missing data without needing any assumption on the distribution of independent variables ().
In the construction of the decision or regression tree diagram, CART, CHAID and Exhaustive CHAID algorithms are non-parametric techniques applied for performing the statistical analysis of nominal, ordinal and scale (continuous) variables (). When the dependent variable is scale, the constructed tree is called regression tree, otherwise classification tree (). Regression tree analysis based on the algorithms is employable instead of multiple linear regression, ridge regression, use of factor analysis scores or principal component analysis scores in multiple linear regression analysis. The classification tree analysis on the basis of the algorithms is a good alternative to logistic regression analysis and discriminant analysis.
Regression tree analysis on the basis of some data mining algorithms, C4.5, CART, CHAID, and Exhaustive CHAID is a non-parametric method used mostly in medicinal, engineering and industrial fields. However, although its applications in animal science are increasing with each passing time (; ; ; ; , ; ; ; ; ; , , ; ; ), a satisfying stress on using goodness of fit criteria was rare in measuring predictive performance of the algorithms. Besides, there is very limited number of the studies on comparatively testing the data mining algorithms; playing a key role in future selection studies (). But, the effect of various parent and child nodes on the predictive performance of the data mining algorithms has not been investigated, so far by taking a basis for goodness of fit criteria. With these reasons, the aim of this study was to evaluate predictive performances of CHAID, Exhaustive CHAID, and CART regression tree methods for different combinations of parent node: child node in the data set regarding animal science.
MATERIALS AND METHODS
With the intention of comparatively surveying in CHAID, Exhaustive CHAID and CART data mining algorithms, the data on 1884 indigenous Mengali sheep (936 males and 948 females) reared at four different farms in Pakistan were used. The input (independent) variables in the data were sex (male and female), birth type (single and twin), lambing year (2005, 2006, 2007, 2008 and 2009), farm (Research Station, Mastung, Quetta and Noshki), dam age (20 to 78 months) and dam weight (25 to 48 kg), respectively in the prediction of weaning weight, a dependent (output) variable. In order to determine the effect of applying different number of animals in parent child nodes on the prediction performance, twenty six combinations from 500:250 to 10:5 were measured in the predictive ability of the algorithms, CART, CHAID and Exhaustive CHAID.
The structure of independent variables can be nominal, ordinal and scale in regression tree method (). As in the present study, regression tree method with the specification of the CHAID, Exhaustive CHAID and CART data mining algorithms are the convenient method that informs about the relationship between each of quantitative traits (such as body weight, milk yield and fleece weight) and nominal, ordinal and scale variables more than one. The trees constructed for dependent variable taking limited values is called classification trees, otherwise the trees obtained by the outcome variable taking unlimited values are called regression trees.
Recursively, CART (Classification and Regression Tree) creates a binary regression tree dividing a subset into 2 small subsets by the time homogenous subsets are attained in the tree, but CHAID algorithms create a regression tree that establishes recursively multi-splits until reaching up to maximum variance among subsets in the tree structure (; ). Risk estimate is expressed as the variance within subsets in the regression tree construction.
Coefficient of Determination
Adjusted Coefficient of Determination
Standard Deviation Ratio
Relative Approximation Error (RAE)
Root Mean Square Error
Coefficient of Variation (%)
Yi, the actual or observed weaning weight (kg) of ith lamb; Ŷi, the predicted weaning weight value of ith lamb; Ȳ, average of the actual weaning weight values of the Mengali lambs; Ɛi, the residual value of ith lamb; Ɛi average of the residual values; k, number of significant independent variables in the model; and n, total lamb number. The residual value of each lamb is expressed as Ɛi= Yi - Ŷi.
The Pearson correlation coefficients between the observed and predicted weaning weight values were estimated by using each of the algorithms. The most predictive algorithm estimates the highest values in r, R2, R2Adj and the lowest values in CV(%), SDRATIO, RAE, and RMSE, respectively (). The regression trees were constructed by using IBM SPSS 23 software SPSS Inc., 2015. We followed the command order,
|Parent node||Child node||RE||
Analyze > Classify > Tree…..in SPSS package program. Since our dependent variable (weaning weight) is scale, CART, CHAID and Exhaustive CHAID data mining algorithms are activated as growing methods in SPSS program to obtain a regression tree diagram. In the construction of the regression tree graph for each algorithm, the cross-validation value of 10 was employed.
RESULTS AND DISCUSSION
Performance of CART and both CHAID algorithms according to different parent: child node proportions specified for the study were compared in order to appoint some environmental factors that impress weaning weight well and summary results of goodness of fit criteria estimated for the algorithms are given in , and , respectively. This information is novel in literature. With being decreased the proportion from 500:250 through 10:5, their goodness of fit criteria were found better. The regression tree structure generated by CHAID algorithm was more interpretable than those constructed by CART and Exhaustive CHAID algorithms, causing over branching.
Following are results of Goodness of fit criteria for weaning weight
When minimum parent: child node proportion was reached from 500:250 to 10:5, RE (0.992 to 0.510),
|Parent node||Child node||RE||
SD ratio (0.706 to 0.506), RAE (0.249 to 0.178), RMSE (0.996 to 0.506) and CV (%) (6.222 to 4.460) reduced for CHAID algorithm, whereas the rest goodness of fit criteria ascended (). It was drawn attention from the statistical evaluation that Pearson correlation coefficient between observed and predicted weaning weight values depictured an increment from 0.708 to 0.863, which is an indicator of reducing the variance within nodes forming in the tree diagram. From , it is obvious that no alteration was found in goodness of fit criteria of CHAID algorithm between parent: child nodes 380:190 and 500:250.
A regression tree diagram was constructed by the CHAID algorithm for parent: child node proportions 500:250, 480:240, 460:230, 440:220, 420:210:400:200 and 380:190, respectively. The tree diagram is depicted in . All lambs in the Mengali population were split into five sub-subsets or populations (Nodes 1-5) as a result of significantly including birth weight, respectively. The heaviest lamb weaning weight of 17.614 kg was found for a subset of the lambs in Node 5 whose birth weight was greater than 3.800 kg.
The decision tree generated for only the parent: child node proportion 140:70 is illustrated in . Node 0 was divided by birth weight (the most effective variable) into 8 subsets named Nodes 1-8, respectively. As birth weight increased from Node 1 to Node 8, weaning weight also increased (). Nodes 2, 3, 4 and 5 were effected by year factor (Adjusted P=0.000). Nodes 11 and 13 were divided by means of farm factor into two subsets (Nodes 19-20 and Nodes 21-22), respectively (Adjusted P=0.000). Node 9 was split by dam age into two subsets, respectively (Adjusted P=0.043). Node 15 was split by means of sex factor into two child subsets (Nodes 23 and 24), (Adjusted P=0.000).
Exhaustive CHAID algorithm
demonstrated that RE (0.992-0.511), RAE (0.249-0.179), RMSE (0.996-0.715) and SD ratio (0.706-0.507), CV (%) (6.222-4.464), R2 (50.176-74.335%) and adjusted R2 (50.128-74.273%) estimates of Exhaustive CHAID tree-based algorithm from the parent-child node proportion 500-250 to 10:5 were made, which means that its predictive performance in weaning weight improved as result of reducing the variance within nodes. However, higher coefficients of determination, and Pearson coefficients (0.708-0.862) between actual and predicted weaning weight were obtained (). For a good fit, an algorithm must have R2 greater than 70%. Goodness of fit criteria of the Exhaustive CHAID was found the definitely same with CHAID algorithm between the parent-child node proportions 500:250 and 380:190 (and ). But, at the proportions 300:150, 240:120 and 80:40, birth weight generated splits in succession in some branches of the regression tree diagram of Exhaustive CHAID algorithm. It could be suggested that use of CHAID and CART algorithms was better at the proportions. In agreement with our results, determined
that the Exhaustive CHAID algorithm had longer operation time compared to CHAID algorithm. However, all the algorithms succeeded in reducing the variance within nodes.
When different parent-child node proportions were examined (), it was concluded that RE (1.022-0.565), RAE (0.253-0.189), RMSE (1.011-0.756) and SD ratio (0.719-0.536) CV (6.463-4.724%), R2 (48.667-71.621%) and adjusted R2 (48.593-71.169%) estimates of CART algorithm provided much better fit from 500:250 through 10:5 as also found in other algorithms. The result may be ascribed to reducing the variance within nodes (subsets) in weaning weight in the regression tree diagrams. Also, corresponding goodness of fit criteria such as coefficient of determination, adjusted coefficient of determination and Pearson correlation between predicted and actual weaning weight values increased under same conditions. It was found in the study that, in line with the other algorithms, CART reduced the variability within nodes or increased the variability among nodes in weaning weight, a response variable. In addition, some authors mentioned that SD ratio estimates of the data mining algorithms should be less than 0.40 for a good fit (; ), which was in virtually agreement with those obtained for the algorithms in the study.
Weaning weight in farm animals played a considerable role on animal husbandry studies. In this respect, we intended in the study to comparatively examine the effect of different parent and child node proportions on predictive performance of CART, CHAID and Exhaustive CHAID data mining algorithms, and to observe the agreeableness of their tree constructions. All the algorithms had much better fit at parent and child node proportions from 500:250 to 10:5. For Mengali sheep data, CHAID algorithm generated more appropriate and deductive regression tree constructions. In literature, for example, the data mining algorithms can give more effective responses in the prediction of live body weight by morphological measurements, which are genetically correlated to the body weight (; ).
As a result, it is expected that employing quantitative traits genetically correlated highly to a target trait like body weight with individual breeding coefficients and the data mining algorithms will serve a useful purpose in gaining superior animals for animal breeding studies. In other words, cut-off values of individual inbreeding coefficients in the regression tree diagrams formed by the tree-based algorithms might release information on degree of inbreeding depression in a flock.
The authors would like to thank Prof. Mohammad Masood Tariq and Dr. Abdul Waheed for allowing us to use their data in the study.
Conflict of interest statement
We declare that we have no conflict of interest.
Akin, M., Eyduran, E. And Reed, B.M., 2016. Using the CHAID data mining algorithm for tissue culture medium optimization. In: In vitro cellular and developmental biology-animal. Vol. 52, Spring ST, New York, NY 10013, USA, pp. 233.
Ali, M., Eyduran, E., Tariq, M.M., Tirink, C., Abbas, F., Bajwa, M.A., Baloch, M.H., Nizamani, A.H., Waheed, A., Awan, M.A., Shah, S.H., Ahmad, Z. and Jan, S., 2015. Comparison of artificial neural network and decision tree algorithms used for predicting live weight at post weaning period from some biometrical characteristics in harnai sheep. Pakistan J. Zool., 47: 1579-1585.
Çamdeviren, H., Mendeş, M., Ozkan, M.M., Toros, F., Şaşmaz, T. and ve Oner, S., 2005. Determination of depression risk factors in children and adolescents by regression tree methodology. Acta Med. Okayama, 59:19-26.
Eyduran, E., Yilmaz, I., Kaygisiz, A. and Aktaş, Z.M., 2013b. An investigation on relationship between lactation milk yield, somatic cell count and udder traits in first lactation Turkish Saanen goat using different statistical techniques. J. Anim. Pl. Sci., 23: 956-963.
Grzesiak, W., Lacroix, R., Wójcik, J. and Blaszczyk, P., 2003. A comparison of neural network and multiple regression predictions for 305-day lactation yield using partial lactation records. Canadian J. Anim. Sci., 83: 307-310.
Grzesiak, W., Błaszczyk, P. and Lacroix, R., 2006. Methods of predicting milk yield in dairy cows predictive capabilities of wood’s lactation curve and artificial neural networks (ANNs). Comput. Electr. Agric., 54: 69-83.
Kayri, M. and Boysan, M., 2008. Assesment of relation between cognitive vulnerability and depression’s level by using classification and regression tree analysis. Hacettepe Univ. Egit. Fakul. Derg., 34: 168-177.
Khan, M.A., Tariq, M.M., Eyduran, E., Tatliyer, A., Rafeeq, M., Abbas, F., Rashid, N., Awan, M.A. and Javed, K., 2014. Estimating body weight from several body measurements in Harnai sheep without multicolinearity problem. J. Anim. Pl. Sci., 24: 120-126.
Orhan, H., Eyduran, E., Tatliyer, A. and Saygici, H., 2016. Prediction of egg weight from egg quality characteristics via ridge regression and regression tree methods. Rev. Brasil. Zootec., 45: 380-385.
Tariq, M.M., Rafeeq, M., Bajwa, M.A., Abbas, F., Waheed, A., Bukhari, F. A. and Akhtar, P., 2012. Prediction of body weight from body measurements using regression tree (RT) method for indigenous sheep breeds in Balochistan, Pakistan. J. Anim. Pl. Sci., 22: 20-24.
Topal, M., Aksakal, V., Bayram, B. and Yağanoğlu, A.M., 2010. An analysis of factors affecting birth weight and actual milk yield in Swedish red cattle using regression tree analysis. J. Anim. Pl. Sci., 20: 63-69.