Regression Analysis Tutorial and Examples
4 stars based on
Logistic regression is one of the most frequently used statistical methods as a binary logistic regression analysis ppt method of data analysis in many fields over the last decade. However, analysis of binary logistic regression analysis ppt and identification of influential outliers are not studied so frequently to check the adequacy of the fitted logistic regression model.
Detection of outliers and influential cases and corresponding treatment is very crucial task of any modeling exercise. A failure to detect influential cases can have severe distortion on the validity of the inferences drawn from such modeling. The aim of this study is to evaluate different measures of standardized residuals and diagnostic statistics by graphical methods to identify potential outliers.
Evaluation of diagnostic statistics and their graphical display detected 25 cases as outliers but they did not play notable effect on parameter estimates and summary measures of fits. It is recommended to use residual analysis and note outlying cases that can frequently lead to valuable insights for strengthening the model.
Often the outcome variable in the social data is in general not a continuous value instead a binary one. In such a case, binary logistic regression is a useful way of describing the relationship between one or more independent variables and a binary outcome variable, expressed as a probability scale that has only two possible values.
Indeed, a generalized linear model is used for binary logistic regression. The most attractive feature of a logistic regression model is neither assumes the linearity in the relationship between the covariates and the outcome variable, nor does it require normally distributed variables. It also does not assume homoscedasticity and in general has less stringent requirements than linear regression models.
Thus logistic regression is used in a wide range of applications leading to binary dependent data analysis Hilbe, ; Agresti, The vast majority of the work related to the logistic regression appears in the experimental epidemiological research but during the last decade it is evident that the technique is frequently used in observational studies. But binary logistic regression analysis ppt of residuals and the identification of outliers and influential cases are not studied so frequently to check the adequacy of the fitted model.
Data obtained from observational studies sometimes can be considered as bad from the point of view of outlying responses. The traditional method of fitting logistic regression models with maximum likelihood, has good optimality properties in ideals settings, but is extremely sensitive to bad data obtained from observational studies Pregibon, Frequently in logistic regression analysis applications, the real data set contains some cases that are binary logistic regression analysis ppt that is the observations for these cases are well separated from the remainder of the data.
These outlying cases may involve large residuals and often have dramatic effects on the fitted maximum likelihood linear predictor. For logistic regression with one or two predictor variables, it is relatively simple to identify outlying cases with respect to their X or Y values by means of scatter plots of residuals and to binary logistic regression analysis ppt whether they are influential in affecting the fitted linear predictor.
When more than two predictor variables are included in the logistic regression model, however, the identification of outlying cases by simple graphical methods becomes difficult. In such a case, traditional standardized residual plots can highlight little regarding outliers and some derived statistics and their plots from basic building blocks binary logistic regression analysis ppt lowess smooth and bubble plots are potential to detect outliers and influential cases Kutner et al.
There are three ways that an observation can be considered as unusual, namely outlier, influence and leverage. In logistic regression, a set of observations whose values deviate from the expected range and produce extremely large residuals and may indicate a sample peculiarity is called outliers. These outliers can unduly influence the results of the analysis and lead to incorrect inferences. An observation is said to be influential if removing the observation substantially changes the estimate of coefficients.
Influence can be thought of as the product of leverage and outliers. An observation with an extreme value on binary logistic regression analysis ppt predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. In fact, the leverage indicates the geometric extremeness of an observation in the multi-dimensional covariate space.
These leverage points can have an unusually large effect on the estimate of logistic regression coefficients Cook, Standardized residuals outside of this range are potential outliers.
In that situation, the lack of fit can be attributed to outliers and the large residuals will be easy to find in the plot. But analysts may attempt to find group of points that are not well fit by the model rather than concentrating on individual points. Techniques for judging the influence of a point on a particular aspect of the fit such as binary logistic regression analysis ppt developed by Pregibon seem more justified than outlier detection Jennings, A failure to detect outliers and hence influential cases can have severe distortion on the validity of the inferences drawn from such modeling exercise.
It would be reasonable to use diagnostics to check if the model can be improved in case of Correct Classification Rate CCR is smaller than The main focus in binary logistic regression analysis ppt study is binary logistic regression analysis ppt detect outliers and influential cases that have a substantial impact on the fitted logistic regression model through appropriate graphical method including smoothing technique.
The Bangladesh Demographic and Health Survey is part of the worldwide Demographic and Health Surveys program, which is designed to collect data on fertility, family planning, maternal and child health. The BDHS is a source of population and health data for policymakers and the research community. BDHS is the fourth survey conducted in Bangladesh and binary logistic regression analysis ppt for the survey started in mid and field work was carried out between January and May A total of 11, eligible women were furnished their responses.
But in this analysis there are only 2, eligible women those are able to bear and desire more children are considered. The women under sterilization, declared in fecund, divorced, widowed, having more than and less than two living children are not involved in the analysis.
Those women who have two living children and able to bear and desire more children are only considered here during the period of global two children campaign. The variable age of the respondent, fertility preference, place of residence, highest year of education, working status and expected number of children are considered in the analysis. The responses are coded 0 for no more and 1 for have another is considered as desire for children which is the binary response variable Y in the analysis.
The age of the respondent X 1place of residence X 2 is coded 0 for urban and 1 for rural, highest year of education X 3working status of respondent X 4 is coded 0 for not working and 1 for working and expected number of children X 5 is coded 0 for two and 1 for more than two are considered as covariates in the logistic regression model.
Several standardized residual plots, lowess smooth and diagnostic plots are used to detect influential outliers. The binary logistic regression model computes the probability of the selected response as a function of the values of the explanatory variables. A major binary logistic regression analysis ppt with the linear probability binary logistic regression analysis ppt is that probabilities are bounded by 0 and 1, but linear functions are inherently unbounded.
The solution is to transform the probability so that it is no longer bounded. Transforming the probability to odds removes the upper bound and natural logarithm of odds also removes the lower bound. Thus, setting the result binary logistic regression analysis ppt to a linear function of the explanatory variables yields logit or binary response model Allison, It is evident that Sigmoidal-shape curve configuration has been found to be appropriate in many applications for which the outcome variable is binary and the corresponding model having more than one explanatory variable can be written as:.
Mainly for this reason the ML method based on Newton-Raphson iteratively reweighted least square algorithm becomes more popular with the researchers Ryan, The sample likelihood function is, in general defined as the joint probability function of the random variables whose realizations constitute the sample. Since the Y i is a Bernoulli random variable, the probability mass function of Y i is.
For convenience in multiple logistic regression models, the likelihood equations can be written in matrix notation as:. No closed form solution exists for the values of that maximize the log-likelihood function. Computer-intensive numerical search procedures are therefore required to find the maximum likelihood estimates and hencebecause the multiple logistic regression model computes the probability of the selected response as a function of the values of the predictor variables.
There are several widely used numerical search procedures, one of these employs iteratively reweighted least squares algorithm. In this study, we shall rely on standard statistical software programs specifically designed for logistic regression to obtain the maximum likelihood estimates of parameters. In order to check the goodness-of-fit of an binary logistic regression analysis ppt multiple logistic regression model one should assume that the model contains those variables that should be in the model and have been entered in the correct functional form.
The goodness-of-fit measures how effectively the model describes the response variable. The distribution of the goodness-of-fit statistics is obtained by letting the sample size n become large. If the number of covariate patterns increases binary logistic regression analysis ppt n then size of each covariate pattern tends to be small.
Generally, the term covariate pattern is used to describe a single set of values for the covariates in the model. Distributional results obtained under the condition that only n become large are said to be based on n-asymptotic. The case most frequently encountered binary logistic regression analysis ppt practice that the model contains one or more continuous covariates.
In such a situation the number of covariate patterns is approximately equal to the sample size and the current study contains two continuous covariates and the number of covariate patterns may not be an issue when the fit of the model is assessed. To assess the goodness-of-fit of the model, researcher should have some specific idea about what it means to say that a model fits.
Thus, a complete assessment of the fitted model involves both the calculation of summary measures of the distance between Y and and a thorough examination of the individual components of these measures.
When model building stage has been completed, binary logistic regression analysis ppt series of logical steps should be used to assess the fit of the model. The components of proposed approach are: The summary measures of goodness-of-fit, as they are routinely provided as binary logistic regression analysis ppt output with any fitted model and give an overall indication of the fit of the model.
The different summary measures like likelihood ratio test, Hosmer and Lemeshow, goodness-of-fit test, Osius and Rojek, normal approximation test, Stukel, test and other supplementary statistics indicate that the model seems to fit quite well. It is also evident that the individual predictors in the fitted model have significant contribution to predict the response variable through likelihood ratio test as well as Wald test Sarkar and Midi, The elaboration of these measures is beyond the scope of the study.
Before concluding that the model fits, it is crucial that other measures be examined to see if fit is supported over the entire set of covariate binary logistic regression analysis ppt. This is accomplished through a series of specialized measures falling under the general heading of residual analysis and regression diagnostics Cook and Weisberg, Residual analysis for logistic regression is more difficult than the linear regression models because the responses take on only the values 0 and 1.
Thus the ith ordinary residual will assume one of the two values as:. The ordinary residuals will not be normally distributed and, indeed their distribution under the assumption that the fitted model is correct is unknown.
Plots of ordinary residuals against fitted values will generally be uninformative. In linear regression a key assumption is that the error variance does not depend on the conditional mean E Y X. Hence, the ordinary residual can be made more comparable by dividing them by the estimated standard error of Y i which is known as Pearson residual denoted by pr i and defined as:.
The Pearson residuals are directly related to the Pearson chi-square goodness-of-fit statistic. The square of Pearson residual measures the contribution of each binary response to the Pearson chi-square test statistic but the test statistic does not follow an approximate chi-square distribution for binary data without replicates. The Pearson residuals do not have unit variance since no allowance has been made for the inherent variation in the fitted value.
A better procedure is to further standardize the binary logistic regression analysis ppt residuals by their estimated standard deviation that is called studentized Pearson residuals. The standard deviation is approximated by:. More clearly leverage is a measure of the importance of an observation to the fit of the model.
The hat matrix for logistic regression satisfies approximately the expression where, is the nx1 vector of linear binary logistic regression analysis ppt. Then studentized Pearson residuals spr i are defined as:. Studentized Pearson residuals are primarily helpful in identifying influential observations and those build in information about the influence of a case, whereas Pearson residuals do not.
More influential cases with high leverages result in high studentized Pearson residuals. Deviance residual is another type of residual. It measures the disagreement between any component of the log likelihood of the fitted model and the corresponding component of the log likelihood that would result if each point were fitted exactly.