cubm package in R to fit CUB models cubm package

The class of CUB models is commonly used by practitioners to model ordinal data, in this paper we propose the cubm package which provides the class of CUB models in the R system for statistical computing. The cubm package allows to specify a formula for each parameter of the model, the Maximum Likelihood (ML) estimation is performed by optimization via the functions nlminb, optim and DEoptim and the variance-covariance matrix can be obtained by numerical approximation of the Hessian matrix or by bootstrap method. The utility of the package is illustrated by an application and a simulation study.


Introduction
The usual statistical approach for modeling ordinal data has been Generalized Linear Models (GLM). When data are collected as ordered responses to a sequence of items (concerning preferences, evaluations, proficiency, etc), current literature has developed a vast amount of results known as Item Response Theory, Iannario (2010). In the last thirteen years, a different approach has been introduced for explaining the behavior of respondents when faced to a single item characterized by ordinal choices, this class of statistical models known as CUB (Combined Uniform and Binomial) has been derived by Piccolo (2003b), Piccolo (2003a), and D'Elia & Piccolo (2005). The model is based on a mixture model that is able to express the stated evaluation via the subject's covariates. Specifically, it examines and compares the uncertainty of the answer and the feeling towards the items.
The recent interest in well-being measurements, initially developed in behavioral contexts, has inspired the application of CUB models to the selection of response categories in a number of research areas. In the food industry, for example, studies have been developed to evaluate preferences and satisfaction from the use of CUB models.  performed a sensory analysis in the food industry in order to obtain useful information for marketing management; Piccolo et al. (2013) studied the importance that respondents assign to a list of intrinsic and extrinsic attributes and the level of agreement that consumers express with a number of statements concerning extra virgin olive oil; Boatto et al. (2016) carried out a study to detect segments of markets based on consumption opinions, purchase characteristics and price of Parmesan cheese, and  conducted an investigation with the objective of evaluating the preferences of fresh food packaging. Other studies have been developed in areas such as education and work, Cafarelli & Crocetta (2016) evaluated the student satisfaction in a Faculty of Economics in Foggia, Gambacorta & Iannario (2013) modeled job satisfaction based on data collected in a Survey in Italy and Capecchi (2015) measured the experience of conflict between personal and organizational ethnics in a large sample of respondents.
Due to the use of the model in many contexts, several authors have developed generalizations, and have improved the initial model. Iannario (2008) specified the statistical implications of dummy covariates by emphasizing the interpretation of the estimated parameters, Iannario (2010) studied the identifiability of the CUB model, Corduas (2011) proposed a test procedure in order to compare CUB models, Innario (2012) and Iannario (2014) studied ordered categorical data with overdispersion in the framework of CUB models, Capecchi & Piccolo (2014) and Grilli et al. (2014) studied and applied latent class CUB models, Oberski & Vermunt (2015) showed the equivalence of loglinear latent class models and CUB models and Piccolo (2015) studied inferential issues on CUB models with covariate. Some generalizations of the CUB models have been introduced, Iannario (2012) proposed a Hierarchical CUB models which is a generalization in which parameters are allowed to be random, and Manisera & Zuccolotto (2013, 2014 generalized CUB models by introducing a new class of models, called Nonlinear CUB, which are able to describe precision processes. The CUB models was implemented in the CUB package (Iannario et al. 2016) written in R system for statistical computing (R Core Team (2017)). The implementation of CUB models relies on one Formula interface (Zeileis & Croissant, 2010), the Maximum Likelihood (ML) estimation is performed by classical EM procedures (McLachlan & Krishnan, 1997) and the optimization procedure is run via the optim function.
Although there is already a package in R for the analysis of CUB models, the cubm package proposed in this paper gives other options to the user to estimate the parameters and variance-covariance matrix, and also allows the user to define a formula for each parameter of the model. In this paper, we describe the cubm package which can be used to estimate parameters. The package is implemented in R system for statistical computing (R Core Team (2017)) and it is available from the GitHub repository. Implementation of CUB models relies on the formula interface of GAMLSS package (Rigby & Stasinopoulos, 2005), allowing to specify a formula for each parameter, the first one for the feeling and the second one for the uncertainty. cubm package fits the CUB models using ML estimation trough different optimization methods: a bounds constrained quasi-Newton method (nlminb), the Differential Evolution algorithm for global optimization of a real-valued function of a real-valued parameter vector (DEoptim, ), a relatively robust method that does not require derivatives (optim: Nelder-Mead), a low-memory optimizer for unconstrained problems with large numbers of parameters (optim: CG), a simple unconstrained variable metric/quasi-Newton method (optim: BFGS), a modest-memory optimizer for bounds constrained problems (optim: L-BFGS-B) and a stochastic method that does not require derivatives (optim: SANN). The variance-covariance matrix can be obtained by numerical approximation to the Hessian matrix (Hessian: numDeriv, (Gilbert & Varadhan, 2016)). If the variancecovariance matrix is not positive definite the procedure uses bootstrap method to estimate the standard deviation of the parameters.
In the remainder of this manuscript we elaborate on cubm's implementation and use. In Section 2 the notation for CUB models is introduced. Section 3 describes the cubm package. Then, Section 4 shows an application. A simulation study is presented in Section 5. Some concluding remarks end the paper.

CUB models
The class of CUB models is built on the basic assumption that, when a subject is asked to express a rating about a given issue on an ordered response scale with m categories, his/her response derives from the combination of a feeling attitude towards the evaluated issue and an intrinsic uncertainty component surrounding the discrete choice. CUB models fit rating data by means of a mixture of two random variables, namely a Shifted Binomial V (m, ξ) with trial parameter m and success probability 1 − ξ, modelling the feeling component, and a discrete Uniform U (m) defined over the support {1, . . . , m}, aimed to model the uncertainty component (Manisera & Zuccolotto 2015). The random variable R represents the observed ratings and has probability mass function given by where r = 1, . . . , m, with θ = (π, ξ) , π ∈ (0, 1] and ξ ∈ [0, 1]. On left panel of Figure 1 we can find the probability mass function for four π values and ξ = 0.70 with m = 5. The 1 − π quantity measures uncertainty that accompanies the choice, for small values of π, the value of 1 − π increases and the form of the distribution tends to be uniform. For the case of π = 0.01 we can see that black line is constant around 0.2. As π increases, 1 − π decreases and the shape of the probability mass function changes, when π = 0.99 the cub distribution in expression (1) corresponds to a shifted binomial distribution with success probability 1 − ξ. On right panel of Figure 1 we can observe the probability mass function for four ξ values and π = 0.99. For small values of ξ, the value of 1 − ξ increases and the distribution tends to give more probability for high values of R, at the contrary, for high values of ξ, the value of 1 − ξ decreases and the distribution tends to give more probability for lower values of R.

Parameter estimation without covariates
Suppose a group of n individuals are asked to rate a product service in a scale from 1 to m. Let R a random variable and r 1 , r 2 , . . . , r n the ratings given by the individuals. If R ∼ CUB(π, ξ, m), the likelihood function of the CUB model is: where θ = (π, ξ, ) and the associated log-likelihood function is: There are not closed forms for maximum likelihood estimates of π and ξ, therefore we need to use numerical methods ).

Parameter estimation with covariates
Suppose a group of n individuals are asked to rate a product service in a scale from 1 to m. Suppose also that for each individual there is additional information such as age, marital status, salary, among others. This additional information corresponds to t variables denoted by X 1 , X 2 , X 3 , . . . , X t . The Table 1 shows a summary of the information available in a CUB model with covariates.
Individual Score X 1 X 2 X 3 · · · X t 1 r 1 Assuming that the responses of the n individuals are distributed CUB(π i , ξ i , m), it is possible to model the parameters π and ξ using subsets of the t covariates shown in the Table 1.
The parameter π for the i-th individual can be modeled using p of the t covariates as follows: where β = (β 0 , β 1 , ..., β p ) and Z i = (1, x 1i , x 2i , . . . , x pi ) . The link function g(·) ensures that the values of β Z i lies within the (0, 1] interval. The most commonly choices for the link functions are logit and probit.
In a similar way, the parameter ξ for the i-th individual can be modeled using q of the t covariates as follows: where γ = (γ 0 , γ 1 , . . . , γ q ) and Substituting the expressions 4 and 5 in the expression 1, we have the following probability mass function: The likelihood function of the CUB model with covariates is given by: with Pr(R = r i |Z i , W i , β, γ) defined in 6 and θ = (β, γ) .
Again for this case we do not have closed forms to estimate β and γ, so we need to use numerical methods.

cubm package
In this section we present the cubm package and some useful functions made in R to fit cub models through maximum likelihood estimation.

Installation
The current version of the cubm package is hosted in github which is a web-based Git repository hosting service. For installation the user need to use the next code that automatically install the devtools package necessary to download the cubm package.

dcub function
The dcub function is used to obtain the probabilities for a cub model given the parameters π, ξ and m as in expression (1). The structure of the function is as follows dcub(x, pi, xi, m, log = FALSE) The arguments for the function are: x: vector of quantiles. pi: uncertainty parameter belongs to (0, 1]. xi: feeling parameter belongs to [0, 1]. m: the maximum value for the response variable. The following code calculates the probability for the cub(π = 0.4, ξ = 0.7, m = 5) distribution shown in Figure 1.

rcub function
The rcub function is used to generate random values from a cub distribution given the parameters π, ξ and m. The structure of the function is as follows rcub(n, pi, xi, m = 5) The arguments for the cub function are: The following code generates 20 observations for cub(π = 0.4, ξ = 0.7, m = 5) distribution. The set.seed function is used to create random numbers that can be reproduced by the user.

cub function
The cub function is used to estimate the parameters via maximum likelihood for a cub model with or without explanatory variables. The structure of the cub function is as follows: cub(pi.fo, xi.fo, m, data=NULL, optimizer="nlminb", pi.link="probit", xi.link="probit") The arguments for the cub function are: pi.fo: a "formula" object for π parameter with two parts using a tilde operator to separate the dependent variable y from the independent variables. For example, y ∼ x1 + x2 is interpreted as modeling y as a linear function of x1 and x2, it can be included interactions and polynomials. xi.fo: a "formula" object for ξ parameter without left part. For example, ∼ x1 + x2 means that we want to model ξ parameter as a linear function of x1 and x2. m: maximum value for the response variable, it must be an integer.
shift: minimum value for the response variable, by default is 1. data: an optional data frame with the response and independent variables. optimizer: optimizer to find the parameter vector, by default is nlminb but are available optim and DEoptim from DEoptim package created by Mullen et al. (2011) that implements the global optimization by differential evolution . pi.link: link function for model π parameter, could be "probit" or "logit", by default is "probit". xi.link: link function for model ξ parameter, could be "probit" or "logit", by default is "probit".

Example 1
In the next example we simulate 1000 observations from a cub(π = 0.4, ξ = 0.7, m = 5) using the seed 1234. The cub function is used with the formula pi.fo = y ∼ 1 to indicate the response variable is y, the ∼ 1 in the formula of pi.fo and xi.fo indicates that we do not have explanatory variables to model π neither ξ. The parameter vector for this example is Θ = (π = 0.4, ξ = 0.7) .
Note that in last examples we used two different optimizers, nlminb and optim, by default cub function uses nlminb.

Simulation study
In this section we present the results from a simulation study to explore the parameter estimation procedures for a model without and with covariates. The scenarios considered here were taken from the structure given in the application from last section.

Simulation study without covariates
To study the estimation procedure without covariates we considered the next model.
The results from this simulation study are shown in the Figure 2. From this figure we note that the sequence ofπ andξ obtain by the cub function goes to the real value very quickly. We observe that even for small samples, the mean estimated parameter is very close the target value given by the dotted line. It can be observed that for n ≤ 30, the blue and orange lines for nlminb and DEoptim respectively, are quite similar.

Simulation study with covariates
To study the estimation procedure with covariates we considered the next model The fixed parameters assumed values of β 0 = 0.93, β 1 = 0.47, γ 0 = −0.95 and γ 1 = −0.33. The values for the covariates gender i and lage i where taken from the original dataset univer. We considered sample size values of n = 10, 20, . . . , 490, 500. For each n we simulated 1000 samples to estimate the parameters, then we calculatde the mean for each estimated parameter. In the same way as in the previous case, we considered the optimizers nlminb, optim and DEoptim available in cub function. The results from this simulation study are shown in the Figure 3. for the intercepts in π and ξ have less variability than mean estimations for the slopes. From this figure is clear that maximum likelihood estimations for each parameter go to the real value as n increases.

Application
In this section we re-analyzed the univer data from Iannario et al. (2016) related to a sample survey conducted in 13 faculties of University of Naples in Italy. The participants were asked to express their opinion about orientation services on a 7 point scale (1 = very unsatisfied, 7 = extremely satisfied). The univer data has 12 variables (faculty, freqserv, age, gender, diploma, residenc, changefa, informat, willingn, officeho, compete, and global) and 2179 observations, the variable called global corresponds to the response variable related to global satisfaction. Figure  4 shows the observed proportion for global satisfaction, from this figure we note that almost 35 % of the participants rated the orientation services with 6 and 31 % of the participants rated the services with a 7, the participants tend to measure the service with high values.
The model considered here can be summarised as: To estimate the π and ξ parameters for the global satisfaction variable with the proposed cubm package we created the model mod0 using the next code.
require(cubm) # Loading the cubm package mod0 <-cub(pi.fo=global~1, xi.fo=~1, m=7, data=univer) summary(mod0) The results for mod0 are shown below. From this output we found that Φ −1 (π) = 1.118 which implies thatπ = 0.868, and in a similar way we found thatξ = 0.171. The model mod0 has a log-likelihood value of −3245.474 with two parameters. In Figure 5 we observed the proportion for global satisfaction in black line and the estimated proportion in red line, we can note that the estimated curve follows the observed curve.  2016) considered a cub model to explain the parameters π and ξ as a function of gender and lage respectively, the model can be summarised as: where global is the response variable with values from 1 to 7, gender = 0 corresponds to man and gender = 1 corresponds to woman, lage is the transformation of age variable given by lage i = log(age i ) − log(age).
To fit the model (12) with the proposed cubm package we used the next code. mod <-cub(pi.fo=global~gender, xi.fo=~lage, m=7, data=univer, optimizer='optim') summary(mod) The results obtained from the summary table for model mod are shown next. From this output we note that the two variables considered gender and lage were significant, this model has a log-likelihood value of −3233.465 with four parameters.
From expression (13) we obtain that the uncertain parameter 1 −π is 0.1771118 for male and 0.0817097 for female, this means that female tends to respond the survey with less uncertain than male.
From expression (14) we obtain that the feeling parameter is 1 −ξ = Φ(0.949215 + 0.325423 × lage), this means that the age is related to feeling in a positive way, elder people tend to response the survey with more feeling than young people. Figure 6 shows the relation between feeling (1 − ξ) with age (left panel) and the transformed age lage (right panel), from both panels we can note that the feeling increases as age increases.