Modelos GAMLSS aplicados en el tratamiento de residuos agroindustriales

In this paper, we present an application of GAMLSS (Generalized Additive Models for Location, Shape and Scale) to study bacterial cellulose production from agro-industrial waste. An experiment was conducted to research the effects of pH and cultivation time on bacterial cellulose yield obtained from discarded bananas. Several models were fitted to the collected data to determine an estimated expression for the mean and variance of bacterial cellulose yield. We found that the mean and variance of cellulose yield decrease as pH increases, while the opposite occurs as cultivation time increases. Palabras clave: GAMLSS models, Gamma distribution, linear regression, parameter estimation. Resumen En este art́ıculo se presenta una aplicación de los modelos GAMLSS (Generalized Additive Models for Location, Shape and Scale) para estudiar la producción de celulosa bacteriana a partir de residuos agroindustriales. El experimento fue realizado para investigar los efectos del pH y el tiempo de cultivo sobre el rendimiento de celulosa bacteriana obtenida a partir de residuos de banano. Varios modelos fueron ajustados a los datos recolectados para determinar expresiones estimadas 1Hernández, F., Torres, M., Arteaga, L., Castro, C. (2015) GAMLSS models applied in the treatment of agro-industrial waste. Comunicaciones en Estad́ıstica, 8(2), 245-254. aProfesor asistente, Universidad Nacional de Colombia, Sede Medelĺın. Colombia. bProfesora titular, Universidad Pontificia Bolivariana, Medelĺın. Colombia. cMágister en biotecnoloǵıa. Universidad Pontificia Bolivariana, Medelĺın. Colombia. dProfesora asociada, Universidad Pontificia Bolivariana, Medelĺın. Colombia.


Introduction
The problems of the massive exploitation of natural resources and environmental pollution have motivated the building of an economy based on renewable materials.For this reason, polymers obtained from renewable resources such as polysaccharides, proteins, and lignin, among others, are attracting considerable attention (Jaramillo et al. 2013).It has been found that valuable products such as bacterial cellulose can be obtained from agro-industrial waste through suitable processing.
Obtaining bacterial cellulose depends on, among other factors, pH and fermentation time, and therefore, it is important to determine the combination of these factors that maximizes the bacterial cellulose yield.
Unlike cellulose from plants, bacterial cellulose (BC) is produced with higher purity and exhibits unique mechanical properties (Shoda & Sugano 2005), making it a suitable raw material for high fidelity acoustic speakers, high quality paper, foods, and as a biomaterial in cosmetics, pharmaceuticals and medicine (Raghunathan 2013, C ¸oban & Biyik 2011, Rani & Appaiah 2013, Chawla et al. 2009).
Relatively high cost of BC production may limit its application to high value-added products (Legge 1990).Significant cost reductions are possible with improvements in fermentation efficiency and economics of scale (Raghunathan 2013).Obtaining bacterial cellulose depends on, among other factors, pH and fermentation time, and therefore, it is important to determine the operational values of these factors that maximizes the bacterial cellulose yield.

GAMLSS
Rigby & Stasinopoulos (2005) proposed the GAMLSS models (Generalized Additive Model for Location Scale and Shape), which assume that the response variables y i (with i = 1, . . ., n) are independent with a probability density function f (y i | θ i ), where θ i = (µ i , σ i , ν i , τ i ) T corresponds to the parameter vector.The first two elements µ i and σ i are the location and scale parameters, and the others are shape parameters.GAMLSS models allow each parameter to be a function of a set of explanatory variables, and the distribution of random variable y i is not limited to the exponential family (Rigby & Stasinopoulos 2005, Stasinopoulos & Rigby 2007).GAMLSS models can be summarized as follows: (1) where g k (•) is a known monotonic link function for k = 1, . . ., 4; µ , σ, ν, τ and η k are n-dimensional vectors; X k are known design matrices of order n × J k associated with fixed effects β k of J k × 1; and Z jk are known design matrices of order n × q jk associated with random effects γ jk of q jk × 1 with multivariate normal distribution.The quantity J k represents the number of covariates used in the fixed effects of η k , while J k represents the number of random effects in η k .The model given in (1) to (4) can be summarized in a compact form as follows: The GAMLSS model considers both continuous and discrete distributions with different parameterizations for the same distribution.The details of the distributions and parameterizations used in GAMLSS models can be found in Rigby & Stasinopoulos (2010, page 199).Another advantage of GAMLSS models is that these models allow the use of fixed effects, random effects and non-parametric smoothing functions to model all parameters of the assumed distribution for the response variable.

Experiment description
An experiment was conducted to study the effect of pH and cultivation time (days) on the production of bacterial cellulose using the microorganism Gluconacetobacter medellinensis.Each sample unit corresponded to 100 grams of overmature banana, which was cut into smaller pieces and homogenized with 400 mL of water using a blender.This mixture was filtered using a cloth membrane.The juice obtained from each sample was analyzed to determine the pH.After completion of the fermentation time, the obtained bacterial cellulose membrane was removed and placed in a solution of KOH at 5% (p/p) for 14 hours at a temperature between 28 and 30 degrees Celsius.The cellulose membranes were then washed successively with water until the pH was neutral, and the washed membranes were dried in a convection oven at 60 degrees Celsius for 24 hours and then at 105 degrees Celsius for 2 hours or until constant weight was reached.At the end of this process, the amount of bacterial cellulose was measured; see Figure 1.The response variable in the experiment was the bacterial cellulose yield calculated in grams of dry BC and obtained on each experimental unit.Figure 2 shows the density plot and boxplot for bacterial cellulose yield, revealing that the response variable is right-skewed with a minimum value of 0.0181, median of 0.0787, maximum of 0.5707 and 5 observations of 32 that appear to be outliers.For these reasons, it seems reasonable to use a skewed distribution to model the cellulose yield.
Figure 2: Density and boxplot for bacterial cellulose yield (g). Source: Own elaboration.
Figure 3 shows the scatterplot for bacterial cellulose yield, pH and cultivation time.We observe that the maximum bacterial cellulose yield was obtained at pH 3.5 with 13 days of cultivation; it was noted that the yield decreases with increasing pH and tends to increase with cultivation time.

Results
In this section, we present the results of using GAMLSS models to explain bacterial cellulose yield (y) with the explanatory variables pH and cultivation time.In Table 1, we present the models considered: models 1 to 3 assume a response variable with normal distribution (only as a reference point), and models 4 to 10 consider asymmetric distributions for the response variable.The third column of the table shows the structure in GAMLSS syntax to model the µ and σ parameters of each distribution.The last column of Table 1 shows the Akaike information criterion (AIC) proposed Comunicaciones en Estadística, diciembre 2015, Vol. 8, No. 2 by Akaike (1973), which is a measure of the relative quality of a statistical model for a given data set.The expression to obtain AIC is given by AIC = −2 l + 2df , where l corresponds to the estimated log-likelihood function defined by l = l( θ) = n i=1 log f (y i | μi , σi , νi , τi ), and df corresponds to the number of estimated parameters.Different models can be compared using their global deviances, GD = −2 l (if they are nested), or using the generalized Akaike information criterion, GAIC = −2 l + # df with # as a required penalty; when # = 2, the GAIC corresponds to the usual Akaike information criterion AIC.The preferred model is the one with the minimum AIC value.Table 1 shows that model 6 has the lowest AIC.This model considers a gamma distribution for cellulose yield with log(•) as the link function to model µ and σ.
The probability density function for the gamma distribution with µ and σ parameters (µ > 0 and σ > 0) is given by where E(Y ) = µ and V ar(Y ) = σ 2 µ 2 .Figure 4 shows the density for two combinations of parameters µ and σ.The gamma distribution is suitable for modeling skewed variables such as bacterial cellulose yield.Table 2 presents the estimated parameters for model 6, which considers the gamma distribution for the response variable.From this table, we can see that each variable is significant at 5% in explaining the µ and σ parameters.
From Table 2, estimated expressions can be obtained for the µ and σ parameters:  The estimated mean and variance for cellulose yield can be expressed in terms of µ and σ as follows: Ê(Y ) = μ = e −1.45−0.65 pH+0.18Time (9) V ar(Y ) = μ2 σ2 = e 0.26−2.42pH+0.36Time (10) From the above expressions, we note that for each additional day of cultivation time, at a fixed value of pH, the mean cellulose yield increases by 19.72% (obtained from e 0.18 = 1.1972); similarly, for fixed cultivation time, the variance decreases by 91.11% for each additional unit of pH (obtained from e −2.42 = 0.0889).Figure 5 plots the estimated mean and variance for several cultivation time values.From this figure, we observe that the mean and variance for cellulose yield decrease as pH increases.The opposite occurs for mean and variance as cultivation time increases.Figure 6 shows the heat plot for the estimated mean of bacterial cellulose yield given by the equation 9 and the colors represent the response variable.From this plot we can see that the maximum expected bacterial cellulose yield can be obtain with a low value of pH and a maximum value of cultivation time.Figure 7 presents the residual analysis for model 6.The distribution of the residuals is not far from the normal distribution, which indicates that this model is appropriate for the data, aditionally, a Shapiro test for normality was carried out with a p-value of 0.4027.Despite of in this experiment the sample size was 32, we found that the model 6 explains properly the cellulose yield because the residuals do not violate the normal distribution assumption for residuals.

Conclusions
GAMLSS model is a useful statistical technique to model all parameters of a probability density (or mass) function for a response variable using a set of covariates.
In this paper we showed an application of GAMLSS to model the bacterial cellulose yield using as covariates pH and cultivation time.The results showed in Figures 5 and 6 point out that the maximum bacterial cellulose yield is obtain for low values of pH and cultivation time close to 14 days, this results agree with the experiment of Castro et al. (2012) that concluded that the optimal bacterial cellulose yield for this type of experiment is found near pH 3.5.The two explanatory variables used in the model were significant in explaining the mean and variance of bacterial cellulose yield; the equations 9 and 10 could be used by researchers to model (or predict) the system behavior under those conditions and to describe the variability of the bacterial cellulose yield.

Figure 4 :
Figure 4: Density for gamma distribution for two parameter combinations.Source: Own elaboration.

Figure 5 :
Figure 5: Estimated mean and variance for three cultivation time values.Source: Own elaboration.

Figure 6 :
Figure 6: Heat plot for estimated mean of the bacterial cellulose yield Ê(Y ).Source: Own elaboration.

Figure 7 :
Figure 7: QQplot and worm plot for residuals of model 6.Source: Own elaboration.

Table 1 :
AIC values for each fitted model.

Table 2 :
Estimated parameters for model 6.