• 12.1 - Logistic Regression

Logistic regression models a relationship between predictor variables and a categorical response variable. For example, we could use logistic regression to model the relationship between various measurements of a manufactured specimen (such as dimensions and chemical composition) to predict if a crack greater than 10 mils will occur (a binary variable: either yes or no). Logistic regression helps us estimate a probability of falling into a certain level of the categorical response given a set of predictors. We can choose from three types of logistic regression, depending on the nature of the categorical response variable:

Binary Logistic Regression :

Used when the response is binary (i.e., it has two possible outcomes). The cracking example given above would utilize binary logistic regression. Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure.

Nominal Logistic Regression :

Used when there are three or more categories with no natural ordering to the levels. Examples of nominal responses could include departments at a business (e.g., marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color (black, red, blue, orange).

Ordinal Logistic Regression :

Used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal. Examples of ordinal responses could be how students rate the effectiveness of a college course (e.g., good, medium, poor), levels of flavors for hot wings, and medical condition (e.g., good, stable, serious, critical).

Particular issues with modelling a categorical response variable include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the response is bounded between 0 and 1). We will investigate ways of dealing with these in the binary logistic regression setting here. Nominal and ordinal logistic regression are not considered in this course.

The multiple binary logistic regression model is the following:

\[\begin{align}\label{logmod} \pi(\textbf{X})&=\frac{\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}{1+\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}\notag \\ & =\frac{\exp(\textbf{X}\beta)}{1+\exp(\textbf{X}\beta)}\\ & =\frac{1}{1+\exp(-\textbf{X}\beta)}, \end{align}\]

where here \(\pi\) denotes a probability and not the irrational number 3.14....

  • \(\pi\) is the probability that an observation is in a specified category of the binary Y variable, generally called the "success probability."
  • Notice that the model describes the probability of an event happening as a function of X variables. For instance, it might provide estimates of the probability that an older person has heart disease.
  • The numerator \(\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})\) must be positive, because it is a power of a positive value ( e ).
  • The denominator of the model is (1 + numerator), so the answer will always be less than 1.
  • With one X variable, the theoretical model for \(\pi\) has an elongated "S" shape (or sigmoidal shape) with asymptotes at 0 and 1, although in sample estimates we may not see this "S" shape if the range of the X variable is limited.

For a sample of size n , the likelihood for a binary logistic regression is given by:

\[\begin{align*} L(\beta;\textbf{y},\textbf{X})&=\prod_{i=1}^{n}\pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}\\ & =\prod_{i=1}^{n}\biggl(\frac{\exp(\textbf{X}_{i}\beta)}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{y_{i}}\biggl(\frac{1}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{1-y_{i}}. \end{align*}\]

This yields the log likelihood:

\[\begin{align*} \ell(\beta)&=\sum_{i=1}^{n}[y_{i}\log(\pi_{i})+(1-y_{i})\log(1-\pi_{i})]\\ & =\sum_{i=1}^{n}[y_{i}\textbf{X}_{i}\beta-\log(1+\exp(\textbf{X}_{i}\beta))]. \end{align*}\]

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, $\hat{\beta}$.

To illustrate, consider data published on n = 27 leukemia patients. The data ( leukemia_remission.txt ) has a response variable of whether leukemia remission occurred (REMISS), which is given by a 1.

The predictor variables are cellularity of the marrow clot section (CELL), smear differential percentage of blasts (SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL), percentage labeling index of the bone marrow leukemia cells (LI), absolute number of blasts in the peripheral blood (BLAST), and the highest temperature prior to start of treatment (TEMP).

The following output shows the estimated logistic regression equation and associated significance tests

  • Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.
  • Select "REMISS" for the Response (the response event for remission is 1 for this data).
  • Select all the predictors as Continuous predictors.
  • Click Options and choose Deviance or Pearson residuals for diagnostic plots.
  • Click Graphs and select "Residuals versus order."
  • Click Results and change "Display of results" to "Expanded tables."
  • Click Storage and select "Coefficients."

Coefficients Term        Coef  SE Coef       95% CI      Z-Value  P-Value     VIF Constant    64.3     75.0  ( -82.7, 211.2)     0.86    0.391 CELL        30.8     52.1  ( -71.4, 133.0)     0.59    0.554   62.46 SMEAR       24.7     61.5  ( -95.9, 145.3)     0.40    0.688  434.42 INFIL      -25.0     65.3  (-152.9, 103.0)    -0.38    0.702  471.10 LI          4.36     2.66  ( -0.85,  9.57)     1.64    0.101    4.43 BLAST      -0.01     2.27  ( -4.45,  4.43)    -0.01    0.996    4.18 TEMP      -100.2     77.8  (-252.6,  52.2)    -1.29    0.198    3.01

The Wald test is the test of significance for individual regression coefficients in logistic regression (recall that we use t -tests in linear regression). For maximum likelihood estimates, the ratio

\[\begin{equation*} Z=\frac{\hat{\beta}_{i}}{\textrm{s.e.}(\hat{\beta}_{i})} \end{equation*}\]

can be used to test $H_{0}: \beta_{i}=0$. The standard normal curve is used to determine the $p$-value of the test. Furthermore, confidence intervals can be constructed as

\[\begin{equation*} \hat{\beta}_{i}\pm z_{1-\alpha/2}\textrm{s.e.}(\hat{\beta}_{i}). \end{equation*}\]

Estimates of the regression coefficients, $\hat{\beta}$, are given in the Coefficients table in the column labeled "Coef." This table also gives coefficient p -values based on Wald tests. The index of the bone marrow leukemia cells (LI) has the smallest p -value and so appears to be closest to a significant predictor of remission occurring. After looking at various subsets of the data, we find that a good model is one which only includes the labeling index as a predictor:

Coefficients Term       Coef  SE Coef      95% CI      Z-Value  P-Value   VIF Constant  -3.78     1.38  (-6.48, -1.08)    -2.74    0.006 LI         2.90     1.19  ( 0.57,  5.22)     2.44    0.015  1.00

Regression Equation P(1)  =  exp(Y')/(1 + exp(Y')) Y' = -3.78 + 2.90 LI

Since we only have a single predictor in this model we can create a Binary Fitted Line Plot to visualize the sigmoidal shape of the fitted logistic regression curve:

Binary fitted line plot

Odds, Log Odds, and Odds Ratio

There are algebraically equivalent ways to write the logistic regression model:

The first is

\[\begin{equation}\label{logmod1} \frac{\pi}{1-\pi}=\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}), \end{equation}\]

which is an equation that describes the odds of being in the current category of interest. By definition, the odds for an event is π  / (1 - π ) such that P is the probability of the event. For example, if you are at the racetrack and there is a 80% chance that a certain horse will win the race, then his odds are 0.80 / (1 - 0.80) = 4, or 4:1.

The second is

\[\begin{equation}\label{logmod2} \log\biggl(\frac{\pi}{1-\pi}\biggr)=\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}, \end{equation}\]

which states that the (natural) logarithm of the odds is a linear function of the X variables (and is often called the log odds ). This is also referred to as the logit transformation of the probability of success,  \(\pi\).

The odds ratio (which we will write as $\theta$) between the odds for two sets of predictors (say $\textbf{X}_{(1)}$ and $\textbf{X}_{(2)}$) is given by

\[\begin{equation*} \theta=\frac{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(1)}}}{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(2)}}}. \end{equation*}\]

For binary logistic regression, the odds of success are:

\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\textbf{X}\beta). \end{equation*}\]

By plugging this into the formula for $\theta$ above and setting $\textbf{X}_{(1)}$ equal to $\textbf{X}_{(2)}$ except in one position (i.e., only one predictor differs by one unit), we can determine the relationship between that predictor and the response. The odds ratio can be any nonnegative number. An odds ratio of 1 serves as the baseline for comparison and indicates there is no association between the response and predictor. If the odds ratio is greater than 1, then the odds of success are higher for higher levels of a continuous predictor (or for the indicated level of a factor). In particular, the odds increase multiplicatively by $\exp(\beta_{j})$ for every one-unit increase in $\textbf{X}_{j}$. If the odds ratio is less than 1, then the odds of success are less for higher levels of a continuous predictor (or for the indicated level of a factor). Values farther from 1 represent stronger degrees of association.

For example, when there is just a single predictor, \(X\), the odds of success are:

\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\beta_0+\beta_1X). \end{equation*}\]

If we increase \(X\) by one unit, the odds ratio is

\[\begin{equation*} \theta=\frac{\exp(\beta_0+\beta_1(X+1))}{\exp(\beta_0+\beta_1X)}=\exp(\beta_1). \end{equation*}\]

To illustrate, the relevant output from the leukemia example is:

Odds Ratios for Continuous Predictors     Odds Ratio        95% CI LI     18.1245  (1.7703, 185.5617)

The regression parameter estimate for LI is $2.89726$, so the odds ratio for LI is calculated as $\exp(2.89726)=18.1245$. The 95% confidence interval is calculated as $\exp(2.89726\pm z_{0.975}*1.19)$, where $z_{0.975}=1.960$ is the $97.5^{\textrm{th}}$ percentile from the standard normal distribution. The interpretation of the odds ratio is that for every increase of 1 unit in LI, the estimated odds of leukemia remission are multiplied by 18.1245. However, since the LI appears to fall between 0 and 2, it may make more sense to say that for every 0.1 unit increase in L1, the estimated odds of remission are multiplied by $\exp(2.89726\times 0.1)=1.336$. Then

  • At LI=0.9, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.9\}=0.310$.
  • At LI=0.8, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.8\}=0.232$.
  • The resulting odds ratio is $\frac{0.310}{0.232}=1.336$, which is the ratio of the odds of remission when LI=0.9 compared to the odds when L1=0.8.

Notice that $1.336\times 0.232=0.310$, which demonstrates the multiplicative effect by $\exp(0.1\hat{\beta_{1}})$ on the odds.

Likelihood Ratio (or Deviance) Test

The  likelihood ratio test  is used to test the null hypothesis that any subset of the $\beta$'s is equal to 0. The number of $\beta$'s in the full model is k +1 , while the number of $\beta$'s in the reduced model is r +1 . (Remember the reduced model is the model that results when the $\beta$'s in the null hypothesis are set to 0.) Thus, the number of $\beta$'s being tested in the null hypothesis is \((k+1)-(r+1)=k-r\). Then the likelihood ratio test statistic is given by:

\[\begin{equation*} \Lambda^{*}=-2(\ell(\hat{\beta}^{(0)})-\ell(\hat{\beta})), \end{equation*}\]

where $\ell(\hat{\beta})$ is the log likelihood of the fitted (full) model and $\ell(\hat{\beta}^{(0)})$ is the log likelihood of the (reduced) model specified by the null hypothesis evaluated at the maximum likelihood estimate of that reduced model. This test statistic has a $\chi^{2}$ distribution with \(k-r\) degrees of freedom. Statistical software often presents results for this test in terms of "deviance," which is defined as \(-2\) times log-likelihood. The notation used for the test statistic is typically $G^2$ = deviance (reduced) – deviance (full).

This test procedure is analagous to the general linear F test procedure for multiple linear regression. However, note that when testing a single coefficient, the Wald test and likelihood ratio test will not in general give identical results.

To illustrate, the relevant software output from the leukemia example is:

Deviance Table Source     DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression  1    8.299     8.299        8.30    0.004 LI          1    8.299     8.299        8.30    0.004 Error      25   26.073     1.043 Total      26   34.372

Since there is only a single predictor for this example, this table simply provides information on the likelihood ratio test for LI ( p -value of 0.004), which is similar but not identical to the earlier Wald test result ( p -value of 0.015). The Deviance Table includes the following:

  • The null (reduced) model in this case has no predictors, so the fitted probabilities are simply the sample proportion of successes, \(9/27=0.333333\). The log-likelihood for the null model is \(\ell(\hat{\beta}^{(0)})=-17.1859\), so the deviance for the null model is \(-2\times-17.1859=34.372\), which is shown in the "Total" row in the Deviance Table.
  • The log-likelihood for the fitted (full) model is \(\ell(\hat{\beta})=-13.0365\), so the deviance for the fitted model is \(-2\times-13.0365=26.073\), which is shown in the "Error" row in the Deviance Table.
  • The likelihood ratio test statistic is therefore \(\Lambda^{*}=-2(-17.1859-(-13.0365))=8.299\), which is the same as \(G^2=34.372-26.073=8.299\).
  • The p -value comes from a $\chi^{2}$ distribution with \(2-1=1\) degrees of freedom.

When using the likelihood ratio (or deviance) test for more than one regression coefficient, we can first fit the "full" model to find deviance (full), which is shown in the "Error" row in the resulting full model Deviance Table. Then fit the "reduced" model (corresponding to the model that results if the null hypothesis is true) to find deviance (reduced), which is shown in the "Error" row in the resulting reduced model Deviance Table. For example, the relevant Deviance Tables for the Disease Outbreak example on pages 581-582 of Applied Linear Regression Models (4th ed) by Kutner et al are:

Full model:

Source      DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression   9   28.322   3.14686       28.32    0.001 Error       88   93.996   1.06813 Total       97  122.318

Reduced model:

Source      DF  Adj Dev  Adj Mean  Chi-Square  P-Value Regression   4   21.263    5.3159       21.26    0.000 Error       93  101.054    1.0866 Total       97  122.318

Here the full model includes four single-factor predictor terms and five two-factor interaction terms, while the reduced model excludes the interaction terms. The test statistic for testing the interaction terms is \(G^2 = 101.054-93.996 = 7.058\), which is compared to a chi-square distribution with \(10-5=5\) degrees of freedom to find the p -value = 0.216 > 0.05 (meaning the interaction terms are not significant at a 5% significance level).

Alternatively, select the corresponding predictor terms last in the full model and request the software to output Sequential (Type I) Deviances. Then add the corresponding Sequential Deviances in the resulting Deviance Table to calculate \(G^2\). For example, the relevant Deviance Table for the Disease Outbreak example is:

Source           DF  Seq Dev  Seq Mean  Chi-Square  P-Value Regression        9   28.322    3.1469       28.32    0.001   Age             1    7.405    7.4050        7.40    0.007   Middle          1    1.804    1.8040        1.80    0.179   Lower           1    1.606    1.6064        1.61    0.205   Sector          1   10.448   10.4481       10.45    0.001   Age*Middle      1    4.570    4.5697        4.57    0.033   Age*Lower       1    1.015    1.0152        1.02    0.314   Age*Sector      1    1.120    1.1202        1.12    0.290   Middle*Sector   1    0.000    0.0001        0.00    0.993   Lower*Sector    1    0.353    0.3531        0.35    0.552 Error            88   93.996    1.0681 Total            97  122.318

The test statistic for testing the interaction terms is \(G^2 = 4.570+1.015+1.120+0.000+0.353 = 7.058\), the same as in the first calculation.

Goodness-of-Fit Tests

Overall performance of the fitted model can be measured by several different goodness-of-fit tests. Two tests that require replicated data (multiple observations with the same values for all the predictors) are the Pearson chi-square goodness-of-fit test and the deviance goodness-of-fit test  (analagous to the multiple linear regression lack-of-fit F-test). Both of these tests have statistics that are approximately chi-square distributed with c - k  - 1 degrees of freedom, where c is the number of distinct combinations of the predictor variables. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of lack of fit.

By contrast, the Hosmer-Lemeshow goodness-of-fit test is useful for unreplicated datasets or for datasets that contain just a few replicated observations. For this test the observations are grouped based on their estimated probabilities. The resulting test statistic is  approximately chi-square distributed with  c  -  2  degrees of freedom, where  c  is the number of groups (generally chosen to be between 5 and 10, depending on the sample size) .

Goodness-of-Fit Tests Test             DF  Chi-Square  P-Value Deviance         25       26.07    0.404 Pearson          25       23.93    0.523 Hosmer-Lemeshow   7        6.87    0.442

Since there is no replicated data for this example, the deviance and Pearson goodness-of-fit tests are invalid, so the first two rows of this table should be ignored. However, the Hosmer-Lemeshow test does not require replicated data so we can interpret its high p -value as indicating no evidence of lack-of-fit.

The calculation of R 2 used in linear regression does not extend directly to logistic regression. One version of R 2 used in logistic regression is defined as

\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})-\ell_{S}(\beta)}, \end{equation*}\]

where $\ell(\hat{\beta_{0}})$ is the log likelihood of the model when only the intercept is included and $\ell_{S}(\beta)$ is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data). This R 2 does go from 0 to 1 with 1 being a perfect fit. With unreplicated data, $\ell_{S}(\beta)=0$, so the formula simplifies to:

\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}=1-\frac{\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}. \end{equation*}\]

Model Summary Deviance   Deviance     R-Sq  R-Sq(adj)    AIC   24.14%     21.23%  30.07

Recall from above that \(\ell(\hat{\beta})=-13.0365\) and \(\ell(\hat{\beta}^{(0)})=-17.1859\), so:

\[\begin{equation*} R^{2}=1-\frac{-13.0365}{-17.1859}=0.2414. \end{equation*}\]

Note that we can obtain the same result by simply using deviances instead of log-likelihoods since the $-2$ factor cancels out:

\[\begin{equation*} R^{2}=1-\frac{26.073}{34.372}=0.2414. \end{equation*}\]

Raw Residual

The raw residual is the difference between the actual response and the estimated probability from the model. The formula for the raw residual is

\[\begin{equation*} r_{i}=y_{i}-\hat{\pi}_{i}. \end{equation*}\]

Pearson Residual

The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation. The formula for the Pearson residuals is

\[\begin{equation*} p_{i}=\frac{r_{i}}{\sqrt{\hat{\pi}_{i}(1-\hat{\pi}_{i})}}. \end{equation*}\]

Deviance Residuals

Deviance residuals are also popular because the sum of squares of these residuals is the deviance statistic. The formula for the deviance residual is

\[\begin{equation*} d_{i}=\pm\sqrt{2\biggl[y_{i}\log\biggl(\frac{y_{i}}{\hat{\pi}_{i}}\biggr)+(1-y_{i})\log\biggl(\frac{1-y_{i}}{1-\hat{\pi}_{i}}\biggr)\biggr]}. \end{equation*}\]

Here are the plots of the Pearson residuals and deviance residuals for the leukemia example. There are no alarming patterns in these plots to suggest a major problem with the model.

residual plots for leukemia data

The hat matrix serves a similar purpose as in the case of linear regression – to measure the influence of each observation on the overall fit of the model – but the interpretation is not as clear due to its more complicated form. The hat values (leverages) are given by

\[\begin{equation*} h_{i,i}=\hat{\pi}_{i}(1-\hat{\pi}_{i})\textbf{x}_{i}^{\textrm{T}}(\textbf{X}^{\textrm{T}}\textbf{W}\textbf{X})\textbf{x}_{i}, \end{equation*}\]

where W is an $n\times n$ diagonal matrix with the values of $\hat{\pi}_{i}(1-\hat{\pi}_{i})$ for $i=1 ,\ldots,n$ on the diagonal. As before, we should investigate any observations with $h_{i,i}>3p/n$ or, failing this, any observations with $h_{i,i}>2p/n$ and very isolated .

Studentized Residuals

We can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by

\[\begin{equation*} sp_{i}=\frac{p_{i}}{\sqrt{1-h_{i,i}}} \end{equation*}\]

and the Studentized deviance residuals are given by

\[\begin{equation*} sd_{i}=\frac{d_{i}}{\sqrt{1-h_{i, i}}}. \end{equation*}\]

Cook's Distances

An extension of Cook's distance for logistic regression measures the overall change in fitted logits due to deleting the $i^{\textrm{th}}$ observation. It is defined by:

\[\begin{equation*} \textrm{C}_{i}=\frac{p_{i}^{2}h _{i,i}}{(k+1)(1-h_{i,i})^{2}}. \end{equation*}\]

Fits and Diagnostics for Unusual Observations         Observed Obs  Probability    Fit  SE Fit      95% CI       Resid  Std Resid  Del Resid        HI   8        0.000  0.849   0.139  (0.403, 0.979)  -1.945      -2.11      -2.19  0.149840 Obs  Cook’s D     DFITS   8      0.58  -1.08011  R R  Large residual

The residuals in this output are deviance residuals, so observation 8 has a deviance residual of \(-1.945\), a studentized deviance residual of \(-2.19\), a leverage (h) of \(0.149840\), and a Cook's distance (C) of 0.58.

Start Here!

  • Welcome to STAT 462!
  • Search Course Materials
  • Lesson 1: Statistical Inference Foundations
  • Lesson 2: Simple Linear Regression (SLR) Model
  • Lesson 3: SLR Evaluation
  • Lesson 4: SLR Assumptions, Estimation & Prediction
  • Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
  • Lesson 6: MLR Assumptions, Estimation & Prediction
  • Lesson 7: Transformations & Interactions
  • Lesson 8: Categorical Predictors
  • Lesson 9: Influential Points
  • Lesson 10: Regression Pitfalls
  • Lesson 11: Model Building
  • 12.2 - Further Logistic Regression Examples
  • 12.3 - Poisson Regression
  • 12.4 - Generalized Linear Models
  • 12.5 - Nonlinear Regression
  • 12.6 - Exponential Regression Example
  • 12.7 - Population Growth Example
  • Website for Applied Regression Modeling, 2nd edition
  • Notation Used in this Course
  • R Software Help
  • Minitab Software Help

Penn State Science

Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs

  • Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence
  • Machine Learning Tutorial

Getting Started with Machine Learning

  • An introduction to Machine Learning
  • Getting started with Machine Learning
  • What is Machine Learning?
  • Types of Machine Learning
  • Best Python libraries for Machine Learning
  • Difference Between Machine Learning and Artificial Intelligence
  • General steps to follow in a Machine Learning Problem
  • Machine Learning Mathematics

Data Preprocessing

  • ML | Introduction to Data in Machine Learning
  • ML | Understanding Data Processing
  • Python | Create Test DataSets using Sklearn
  • Generate Test Datasets for Machine learning
  • ML | Overview of Data Cleaning
  • One Hot Encoding in Machine Learning
  • ML | Dummy variable trap in Regression Models
  • What is Exploratory Data Analysis?
  • ML | Feature Scaling - Part 1
  • Feature Engineering: Scaling, Normalization, and Standardization
  • Label Encoding in Python
  • ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Classification & Regression

  • Ordinary Least Squares (OLS) using statsmodels
  • Linear Regression (Python Implementation)
  • ML | Multiple Linear Regression using Python
  • Polynomial Regression ( From Scratch using Python )
  • Implementation of Bayesian Regression
  • How to Perform Quantile Regression in Python
  • Isotonic Regression in Scikit Learn
  • Stepwise Regression in Python
  • Least Angle Regression (LARS)

Logistic Regression in Machine Learning

  • Understanding Activation Functions in Depth
  • Regularization in Machine Learning
  • Implementation of Lasso Regression From Scratch using Python
  • Implementation of Ridge Regression from Scratch using Python

K-Nearest Neighbors (KNN)

  • K-Nearest Neighbor(KNN) Algorithm
  • Implementation of Elastic Net Regression From Scratch
  • Brute Force Approach and its pros and cons
  • ML | Implementation of KNN classifier using Sklearn
  • Regression using k-Nearest Neighbors in R Programming

Support Vector Machines

  • Support Vector Machine (SVM) Algorithm
  • Classifying data using Support Vector Machines(SVMs) in Python
  • Support Vector Regression (SVR) using Linear and Non-Linear Kernels in Scikit Learn
  • Major Kernel Functions in Support Vector Machine (SVM)

Decision Tree

  • Python | Decision tree implementation
  • CART (Classification And Regression Tree) in Machine Learning
  • Decision Tree Classifiers in R Programming
  • Python | Decision Tree Regression using sklearn

Ensemble Learning

  • Ensemble Methods in Python
  • Random Forest Regression in Python
  • ML | Extra Tree Classifier for Feature Selection
  • Implementing the AdaBoost Algorithm From Scratch
  • Gradient Boosting in ML
  • CatBoost in Machine Learning
  • LightGBM (Light Gradient Boosting Machine)
  • Stacking in Machine Learning

Generative Model

  • ML | Naive Bayes Scratch Implementation using Python
  • Applying Multinomial Naive Bayes to NLP Problems
  • Gaussian Process Classification (GPC) on the XOR Dataset in Scikit Learn
  • Gaussian Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Basic Understanding of Bayesian Belief Networks
  • Hidden Markov Model in Machine learning

Time Series Forecasting

  • Components of Time Series Data
  • AutoCorrelation
  • How to Check if Time Series Data is Stationary with Python?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • Exponential Smoothing in R Programming
  • Python | ARIMA Model for Time Series Forecasting

Clustering Algorithm

  • K means Clustering - Introduction
  • Hierarchical Clustering in Machine Learning
  • Principal Component Analysis(PCA)
  • ML | T-distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
  • DBSCAN Clustering in ML | Density based clustering
  • Spectral Clustering in Machine Learning
  • Gaussian Mixture Model
  • ML | Mean-Shift Clustering

Convolutional Neural Networks

  • Introduction to Convolution Neural Network
  • Image Classification using CNN
  • What is Transfer Learning?

Recurrent Neural Networks

  • Introduction to Recurrent Neural Network
  • Introduction to Natural Language Processing
  • NLP Sequencing
  • Bias-Variance Trade Off - Machine Learning

Reinforcement Learning

  • Reinforcement learning
  • Markov Decision Process
  • Q-Learning in Python
  • Deep Q-Learning
  • Deep Learning Tutorial
  • Computer Vision Tutorial
  • Natural Language Processing (NLP) Tutorial

Model Deployment and Productionization

  • Python | Build a REST API using Flask
  • How To Use Docker for Machine Learning?
  • Cloud Deployment Models

Advanced Topics

  • What is AutoML in Machine Learning?
  • Generative Adversarial Network (GAN)
  • Explanation of BERT Model - NLP
  • What is a Large Language Model (LLM)
  • Variational AutoEncoders
  • Transfer Learning with Fine-tuning
  • 100 Days of Machine Learning - A Complete Guide For Beginners
  • 100+ Machine Learning Projects with Source Code [2024]

Logistic regression is a supervised machine learning algorithm used for classification tasks where the goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is a statistical algorithm which analyze the relationship between two data factors. The article explores the fundamentals of logistic regression, it’s types and implementations.

Table of Content

What is Logistic Regression?

Logistic function – sigmoid function, types of logistic regression, assumptions of logistic regression, how does logistic regression work, code implementation for logistic regression, precision-recall tradeoff in logistic regression threshold setting, how to evaluate logistic regression model, differences between linear and logistic regression.

Logistic regression is used for binary classification where we use sigmoid function , that takes input as independent variables and produces a probability value between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class 0. It’s referred to as regression because it is the extension of linear regression but is mainly used for classification problems.

Key Points:

  • Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value.
  • It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
  • In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function, which predicts two maximum values (0 or 1).
  • The sigmoid function is a mathematical function used to map the predicted values to probabilities.
  • It maps any real value into another value within a range of 0 and 1. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the “S” form.
  • The S-form curve is called the Sigmoid function or the logistic function.
  • In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.

On the basis of the categories, Logistic Regression can be classified into three types:

  • Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
  • Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
  • Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as “low”, “Medium”, or “High”.

We will explore the assumptions of logistic regression as understanding these assumptions is important to ensure that we are using appropriate application of the model. The assumption include:

  • Independent observations: Each observation is independent of the other. meaning there is no correlation between any input variables.
  • Binary dependent variables: It takes the assumption that the dependent variable must be binary or dichotomous, meaning it can take only two values. For more than two categories SoftMax functions are used.
  • Linearity relationship between independent variables and log odds: The relationship between the independent variables and the log odds of the dependent variable should be linear.
  • No outliers: There should be no outliers in the dataset.
  • Large sample size: The sample size is sufficiently large

Terminologies involved in Logistic Regression

Here are some common terms involved in logistic regression:

  • Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
  • Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
  • Logistic function: The formula used to represent how the independent and dependent variables relate to one another. The logistic function transforms the input variables into a probability value between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
  • Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the probability is the ratio of something occurring to everything that could possibly occur.
  • Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression, the log odds of the dependent variable are modeled as a linear combination of the independent variables and the intercept.
  • Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables relate to one another.
  • Intercept: A constant term in the logistic regression model, which represents the log odds when all independent variables are equal to zero.
  • Maximum likelihood estimation : The method used to estimate the coefficients of the logistic regression model, which maximizes the likelihood of observing the data given the model.

The logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function, which maps any real-valued set of independent variables input into a value between 0 and 1. This function is known as the logistic function.

Let the independent input features be:

  [Tex]X = \begin{bmatrix} x_{11}  & … & x_{1m}\\ x_{21}  & … & x_{2m} \\  \vdots & \ddots  & \vdots  \\ x_{n1}  & … & x_{nm} \end{bmatrix}[/Tex]  

 and the dependent variable is Y having only binary value i.e. 0 or 1. 

[Tex]Y = \begin{cases} 0 & \text{ if } Class\;1 \\ 1 & \text{ if } Class\;2 \end{cases} [/Tex]

then, apply the multi-linear function to the input variables X.

[Tex]z = \left(\sum_{i=1}^{n} w_{i}x_{i}\right) + b [/Tex]

Here  [Tex]x_i [/Tex]  is the ith observation of X,  [Tex]w_i = [w_1, w_2, w_3, \cdots,w_m] [/Tex]  is the weights or Coefficient, and b is the bias term also known as intercept. simply this can be represented as the dot product of weight and bias.

[Tex]z = w\cdot X +b [/Tex]

whatever we discussed above is the linear regression . 

Sigmoid Function

Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e. predicted y.

[Tex]\sigma(z) = \frac{1}{1-e^{-z}} [/Tex]

sigmoid function - Geeksforgeeks

Sigmoid function

As shown above, the figure sigmoid function converts the continuous variable data into the probability i.e. between 0 and 1. 

  • [Tex]\sigma(z)    [/Tex]  tends towards 1 as  [Tex]z\rightarrow\infty [/Tex]
  • [Tex]\sigma(z)    [/Tex]  tends towards 0 as  [Tex]z\rightarrow-\infty [/Tex]
  • [Tex]\sigma(z)    [/Tex]  is always bounded between 0 and 1

where the probability of being a class can be measured as:

[Tex]P(y=1) = \sigma(z) \\ P(y=0) = 1-\sigma(z) [/Tex]

Logistic Regression Equation

The odd is the ratio of something occurring to something not occurring. it is different from probability as the probability is the ratio of something occurring to everything that could possibly occur. so odd will be:

[Tex]\frac{p(x)}{1-p(x)}  = e^z[/Tex]

Applying natural log on odd. then log odd will be:

[Tex]\begin{aligned} \log \left[\frac{p(x)}{1-p(x)} \right] &= z \\ \log \left[\frac{p(x)}{1-p(x)} \right] &= w\cdot X +b \\ \frac{p(x)}{1-p(x)}&= e^{w\cdot X +b} \;\;\cdots\text{Exponentiate both sides} \\ p(x) &=e^{w\cdot X +b}\cdot (1-p(x)) \\p(x) &=e^{w\cdot X +b}-e^{w\cdot X +b}\cdot p(x)) \\p(x)+e^{w\cdot X +b}\cdot p(x))&=e^{w\cdot X +b} \\p(x)(1+e^{w\cdot X +b}) &=e^{w\cdot X +b} \\p(x)&= \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}} \end{aligned}[/Tex]

then the final logistic regression equation will be:

[Tex]p(X;b,w) = \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}} = \frac{1}{1+e^{-w\cdot X +b}}[/Tex]

Likelihood Function for Logistic Regression

The predicted probabilities will be:

  • for y=1 The predicted probabilities will be: p(X;b,w) = p(x)
  • for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1-p(x)

[Tex]L(b,w) = \prod_{i=1}^{n}p(x_i)^{y_i}(1-p(x_i))^{1-y_i}[/Tex]

Taking natural logs on both sides

[Tex]\begin{aligned}\log(L(b,w)) &= \sum_{i=1}^{n} y_i\log p(x_i)\;+\; (1-y_i)\log(1-p(x_i)) \\ &=\sum_{i=1}^{n} y_i\log p(x_i)+\log(1-p(x_i))-y_i\log(1-p(x_i)) \\ &=\sum_{i=1}^{n} \log(1-p(x_i)) +\sum_{i=1}^{n}y_i\log \frac{p(x_i)}{1-p(x_i} \\ &=\sum_{i=1}^{n} -\log1-e^{-(w\cdot x_i+b)} +\sum_{i=1}^{n}y_i (w\cdot x_i +b) \\ &=\sum_{i=1}^{n} -\log1+e^{w\cdot x_i+b} +\sum_{i=1}^{n}y_i (w\cdot x_i +b) \end{aligned}[/Tex]

Gradient of the log-likelihood function

To find the maximum likelihood estimates, we differentiate w.r.t w,

[Tex]\begin{aligned} \frac{\partial J(l(b,w)}{\partial w_j}&=-\sum_{i=n}^{n}\frac{1}{1+e^{w\cdot x_i+b}}e^{w\cdot x_i+b} x_{ij} +\sum_{i=1}^{n}y_{i}x_{ij} \\&=-\sum_{i=n}^{n}p(x_i;b,w)x_{ij}+\sum_{i=1}^{n}y_{i}x_{ij} \\&=\sum_{i=n}^{n}(y_i -p(x_i;b,w))x_{ij} \end{aligned} [/Tex]

Binomial Logistic regression:  

Target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc., in this case, sigmoid functions are used, which is already discussed above.

Importing necessary libraries based on the requirement of model. This Python code shows how to use the breast cancer dataset to implement a Logistic Regression model for classification.

# import the necessary libraries from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load the breast cancer dataset X , y = load_breast_cancer ( return_X_y = True ) # split the train and test dataset X_train , X_test , \ y_train , y_test = train_test_split ( X , y , test_size = 0.20 , random_state = 23 ) # LogisticRegression clf = LogisticRegression ( random_state = 0 ) clf . fit ( X_train , y_train ) # Prediction y_pred = clf . predict ( X_test ) acc = accuracy_score ( y_test , y_pred ) print ( "Logistic Regression model accuracy (in %):" , acc * 100 )

Logistic Regression model accuracy (in %): 95.6140350877193

Multinomial Logistic Regression:

Target variable can have 3 or more possible types which are not ordered (i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.

In this case, the softmax function is used in place of the sigmoid function. Softmax function for K classes will be:

[Tex]\text{softmax}(z_i) =\frac{ e^{z_i}}{\sum_{j=1}^{K}e^{z_{j}}}[/Tex]

Here, K represents the number of elements in the vector z, and i, j iterates over all the elements in the vector.

Then the probability for class c will be:

[Tex]P(Y=c | \overrightarrow{X}=x) = \frac{e^{w_c \cdot x + b_c}}{\sum_{k=1}^{K}e^{w_k \cdot x + b_k}}[/Tex]

In Multinomial Logistic Regression, the output variable can have more than two possible discrete outputs . Consider the Digit Dataset. 

from sklearn.model_selection import train_test_split from sklearn import datasets , linear_model , metrics # load the digit dataset digits = datasets . load_digits () # defining feature matrix(X) and response vector(y) X = digits . data y = digits . target # splitting X and y into training and testing sets X_train , X_test , \ y_train , y_test = train_test_split ( X , y , test_size = 0.4 , random_state = 1 ) # create logistic regression object reg = linear_model . LogisticRegression () # train the model using the training sets reg . fit ( X_train , y_train ) # making predictions on the testing set y_pred = reg . predict ( X_test ) # comparing actual response values (y_test) # with predicted response values (y_pred) print ( "Logistic Regression model accuracy(in %):" , metrics . accuracy_score ( y_test , y_pred ) * 100 )

Logistic Regression model accuracy(in %): 96.52294853963839

We can evaluate the logistic regression model using the following metrics:

  • Accuracy: Accuracy provides the proportion of correctly classified instances. [Tex]Accuracy = \frac{True \, Positives + True \, Negatives}{Total} [/Tex]
  • Precision: Precision focuses on the accuracy of positive predictions. [Tex]Precision = \frac{True \, Positives }{True\, Positives + False \, Positives} [/Tex]
  • Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly predicted positive instances among all actual positive instances. [Tex]Recall = \frac{ True \, Positives}{True\, Positives + False \, Negatives} [/Tex]
  • F1 Score: F1 score is the harmonic mean of precision and recall. [Tex]F1 \, Score = 2 * \frac{Precision * Recall}{Precision + Recall} [/Tex]
  • Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC curve plots the true positive rate against the false positive rate at various thresholds. AUC-ROC measures the area under this curve, providing an aggregate measure of a model’s performance across different classification thresholds.
  • Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-PR measures the area under the precision-recall curve, providing a summary of a model’s performance across different precision-recall trade-offs.

Logistic regression becomes a classification technique only when a decision threshold is brought into the picture. The setting of the threshold value is a very important aspect of Logistic regression and is dependent on the classification problem itself.

The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall being 1, but this seldom is the case.

In the case of a Precision-Recall tradeoff , we use the following arguments to decide upon the threshold:

  • Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value that has a low value of Precision or a high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because the absence of cancer can be detected by further medical diseases, but the presence of the disease cannot be detected in an already rejected candidate.
  • High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value that has a high value of Precision or a low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalized advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.

The difference between linear regression and logistic regression is that linear regression output is the continuous value that can be anything while logistic regression predicts the probability that an instance belongs to a given class or not.

Logistic Regression – Frequently Asked Questions (FAQs)

What is logistic regression in machine learning.

Logistic regression is a statistical method for developing machine learning models with binary dependent variables, i.e. binary. Logistic regression is a statistical technique used to describe data and the relationship between one dependent variable and one or more independent variables.

What are the three types of logistic regression?

Logistic regression is classified into three types: binary, multinomial, and ordinal. They differ in execution as well as theory. Binary regression is concerned with two possible outcomes: yes or no. Multinomial logistic regression is used when there are three or more values.

Why logistic regression is used for classification problem?

Logistic regression is easier to implement, interpret, and train. It classifies unknown records very quickly. When the dataset is linearly separable, it performs well. Model coefficients can be interpreted as indicators of feature importance.

What distinguishes Logistic Regression from Linear Regression?

While Linear Regression is used to predict continuous outcomes, Logistic Regression is used to predict the likelihood of an observation falling into a specific category. Logistic Regression employs an S-shaped logistic function to map predicted values between 0 and 1.

What role does the logistic function play in Logistic Regression?

Logistic Regression relies on the logistic function to convert the output into a probability score. This score represents the probability that an observation belongs to a particular class. The S-shaped curve assists in thresholding and categorising data into binary outcomes.

Please Login to comment...

Similar reads.

  • Computer Subject
  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

scikit-learn homepage

LogisticRegression #

Logistic Regression (aka logit, MaxEnt) classifier.

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)

This class implements regularized logistic regression using the ‘liblinear’ library, ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ solvers. Note that regularization is applied by default . It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization with primal formulation, or no regularization. The ‘liblinear’ solver supports both L1 and L2 regularization, with a dual formulation only for the L2 penalty. The Elastic-Net regularization is only supported by the ‘saga’ solver.

Read more in the User Guide .

Specify the norm of the penalty:

None : no penalty is added;

'l2' : add a L2 penalty term and it is the default choice;

'l1' : add a L1 penalty term;

'elasticnet' : both L1 and L2 penalty terms are added.

Some penalties may not work with some solvers. See the parameter solver below, to know the compatibility between the penalty and solver.

Added in version 0.19: l1 penalty with SAGA solver (allowing ‘multinomial’ + L1)

Dual (constrained) or primal (regularized, see also this equation ) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.

Tolerance for stopping criteria.

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight .

Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.

Weights associated with classes in the form {class_label: weight} . If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) .

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

Added in version 0.17: class_weight=’balanced’

Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.

Algorithm to use in the optimization problem. Default is ‘lbfgs’. To choose a solver, you might want to consider the following aspects:

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;

‘liblinear’ and ‘newton-cholesky’ can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrapt it with the OneVsRestClassifier .

‘newton-cholesky’ is a good choice for n_samples >> n_features , especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on n_features because it explicitly computes the Hessian matrix.

The choice of the algorithm depends on the penalty chosen and on (multinomial) multiclass support:

‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing .

Refer to the User Guide for more information regarding LogisticRegression and more specifically the Table summarizing solver/penalty supports.

Added in version 0.17: Stochastic Average Gradient descent solver.

Added in version 0.19: SAGA solver.

Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in 0.22.

Added in version 1.2: newton-cholesky solver.

Maximum number of iterations taken for the solvers to converge.

If the option chosen is ‘ovr’, then a binary problem is fit for each label. For ‘multinomial’ the loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary . ‘multinomial’ is unavailable when solver=’liblinear’. ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

Added in version 0.18: Stochastic Average Gradient descent solver for ‘multinomial’ case.

Changed in version 0.22: Default changed from ‘ovr’ to ‘auto’ in 0.22.

Deprecated since version 1.5: multi_class was deprecated in version 1.5 and will be removed in 1.7. From then on, the recommended ‘multinomial’ will always be used for n_classes >= 3 . Solvers that do not support ‘multinomial’ will raise an error. Use sklearn.multiclass.OneVsRestClassifier(LogisticRegression()) if you still want to use OvR.

For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See the Glossary .

Added in version 0.17: warm_start to support lbfgs , newton-cg , sag , saga solvers.

Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. This parameter is ignored when the solver is set to ‘liblinear’ regardless of whether ‘multi_class’ is specified or not. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1 . Only used if penalty='elasticnet' . Setting l1_ratio=0 is equivalent to using penalty='l2' , while setting l1_ratio=1 is equivalent to using penalty='l1' . For 0 < l1_ratio <1 , the penalty is a combination of L1 and L2.

A list of class labels known to the classifier.

Coefficient of the features in the decision function.

coef_ is of shape (1, n_features) when the given problem is binary. In particular, when multi_class='multinomial' , coef_ corresponds to outcome 1 (True) and -coef_ corresponds to outcome 0 (False).

Intercept (a.k.a. bias) added to the decision function.

If fit_intercept is set to False, the intercept is set to zero. intercept_ is of shape (1,) when the given problem is binary. In particular, when multi_class='multinomial' , intercept_ corresponds to outcome 1 (True) and -intercept_ corresponds to outcome 0 (False).

Number of features seen during fit .

Added in version 0.24.

Names of features seen during fit . Defined only when X has feature names that are all strings.

Added in version 1.0.

Actual number of iterations for all classes. If binary or multinomial, it returns only 1 element. For liblinear solver, only the maximum number of iteration across all classes is given.

Changed in version 0.20: In SciPy <= 1.0.0 the number of lbfgs iterations may exceed max_iter . n_iter_ will now report at most max_iter .

Incrementally trained logistic regression (when given the parameter loss="log_loss" ).

Logistic regression with built-in cross validation.

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

Predict output may not match that of standalone liblinear in certain cases. See differences from liblinear in the narrative documentation.

Ciyou Zhu, Richard Byrd, Jorge Nocedal and Jose Luis Morales. http://users.iems.northwestern.edu/~nocedal/lbfgsb.html

https://www.csie.ntu.edu.tw/~cjlin/liblinear/

Minimizing Finite Sums with the Stochastic Average Gradient https://hal.inria.fr/hal-00860051/document

“SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives”

methods for logistic regression and maximum entropy models. Machine Learning 85(1-2):41-75. https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf

Predict confidence scores for samples.

The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.

The data matrix for which we want to get the confidence scores.

Confidence scores per (n_samples, n_classes) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.

Convert coefficient matrix to dense array format.

Converts the coef_ member (back) to a numpy.ndarray. This is the default format of coef_ and is required for fitting, so calling this method is only required on models that have previously been sparsified; otherwise, it is a no-op.

Fitted estimator.

Fit the model according to the given training data.

Training vector, where n_samples is the number of samples and n_features is the number of features.

Target vector relative to X.

Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight.

Added in version 0.17: sample_weight support to LogisticRegression.

The SAGA solver supports both float64 and float32 bit arrays.

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

A MetadataRequest encapsulating routing information.

Get parameters for this estimator.

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Parameter names mapped to their values.

Predict class labels for samples in X.

The data matrix for which we want to get the predictions.

Vector containing the class labels for each sample.

Predict logarithm of probability estimates.

The returned estimates for all classes are ordered by the label of classes.

Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns the log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_ .

Probability estimates.

For a multi_class problem, if multi_class is set to be “multinomial” the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e. calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.

Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_ .

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Test samples.

True labels for X .

Sample weights.

Mean accuracy of self.predict(X) w.r.t. y .

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False : metadata is not requested and the meta-estimator will not pass it to fit .

None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Metadata routing for sample_weight parameter in fit .

The updated object.

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline ). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Estimator parameters.

Estimator instance.

Request metadata passed to the score method.

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False : metadata is not requested and the meta-estimator will not pass it to score .

Metadata routing for sample_weight parameter in score .

Convert coefficient matrix to sparse format.

Converts the coef_ member to a scipy.sparse matrix, which for L1-regularized models can be much more memory- and storage-efficient than the usual numpy.ndarray representation.

The intercept_ member is not converted.

For non-sparse models, i.e. when there are not many zeros in coef_ , this may actually increase memory usage, so use this method with care. A rule of thumb is that the number of zero elements, which can be computed with (coef_ == 0).sum() , must be more than 50% for this to provide significant benefits.

After calling this method, further fitting with the partial_fit method (if any) will not work until you call densify.

Gallery examples #

logistic regression hypothesis set

Release Highlights for scikit-learn 1.5

logistic regression hypothesis set

Release Highlights for scikit-learn 1.3

logistic regression hypothesis set

Release Highlights for scikit-learn 1.1

logistic regression hypothesis set

Release Highlights for scikit-learn 1.0

logistic regression hypothesis set

Release Highlights for scikit-learn 0.24

logistic regression hypothesis set

Release Highlights for scikit-learn 0.23

logistic regression hypothesis set

Release Highlights for scikit-learn 0.22

logistic regression hypothesis set

Probability Calibration curves

logistic regression hypothesis set

Plot classification probability

logistic regression hypothesis set

Feature transformations with ensembles of trees

logistic regression hypothesis set

Plot class probabilities calculated by the VotingClassifier

logistic regression hypothesis set

Model-based and sequential feature selection

logistic regression hypothesis set

Recursive feature elimination

logistic regression hypothesis set

Recursive feature elimination with cross-validation

logistic regression hypothesis set

Comparing various online solvers

logistic regression hypothesis set

L1 Penalty and Sparsity in Logistic Regression

logistic regression hypothesis set

Logistic Regression 3-class Classifier

logistic regression hypothesis set

Logistic function

logistic regression hypothesis set

MNIST classification using multinomial logistic + L1

logistic regression hypothesis set

Multiclass sparse logistic regression on 20newgroups

logistic regression hypothesis set

Plot multinomial and One-vs-Rest Logistic Regression

logistic regression hypothesis set

Regularization path of L1- Logistic Regression

logistic regression hypothesis set

Displaying Pipelines

logistic regression hypothesis set

Displaying estimators and complex pipelines

logistic regression hypothesis set

Introducing the set_output API

logistic regression hypothesis set

Visualizations with Display Objects

logistic regression hypothesis set

Class Likelihood Ratios to measure classification performance

logistic regression hypothesis set

Multiclass Receiver Operating Characteristic (ROC)

logistic regression hypothesis set

Post-hoc tuning the cut-off point of decision function

logistic regression hypothesis set

Post-tuning the decision threshold for cost-sensitive learning

logistic regression hypothesis set

Multilabel classification using a classifier chain

logistic regression hypothesis set

Restricted Boltzmann Machine features for digit classification

logistic regression hypothesis set

Column Transformer with Mixed Types

logistic regression hypothesis set

Pipelining: chaining a PCA and a logistic regression

logistic regression hypothesis set

Feature discretization

logistic regression hypothesis set

Digits Classification Exercise

logistic regression hypothesis set

Classification of text documents using sparse features

An Introduction to Data Analysis

15.2 logistic regression.

Suppose \(y \in \{0,1\}^n\) is an \(n\) -placed vector of binary outcomes, and \(X\) a predictor matrix for a linear regression model. A Bayesian logistic regression model has the following form:

\[ \begin{align*} \beta, \sigma & \sim \text{some prior} \\ \xi & = X \beta && \text{[linear predictor]} \\ \eta_i & = \text{logistic}(\xi_i) && \text{[predictor of central tendency]} \\ y_i & \sim \text{Bernoulli}(\eta_i) && \text{[likelihood]} \\ \end{align*} \] The logistic function used as a link function is a function in \(\mathbb{R} \rightarrow [0;1]\) , i.e., from the reals to the unit interval. It is defined as:

\[\text{logistic}(\xi_i) = (1 + \exp(-\xi_i))^{-1}\] It’s shape (a sigmoid, or S-shaped curve) is this:

logistic regression hypothesis set

We use the Simon task data as an example application. So far we only tested the first of two hypotheses about the Simon task data, namely the hypothesis relating to reaction times. The second hypothesis which arose in the context of the Simon task refers to the accuracy of answers, i.e., the proportion of “correct” choices:

\[ \text{Accuracy}_{\text{correct},\ \text{congruent}} > \text{Accuracy}_{\text{correct},\ \text{incongruent}} \] Notice that correctness is a binary categorical variable. Therefore, we use logistic regression to test this hypothesis.

Here is how to set up a logistic regression model with brms . The only thing that is new here is that we specify explicitly the likelihood function and the (inverse!) link function. 70 This is done using the syntax family = bernoulli(link = "logit") . For later hypothesis testing we also use proper priors and take samples from the prior as well.

The Bayesian summary statistics of the posterior samples of values for regression coefficients are:

What do these specific numerical estimates for coefficients mean? The mean estimate for the linear predictor \(\xi_\text{cong}\) for the “congruent” condition is roughly 3.204. The mean estimate for the linear predictor \(\xi_\text{inc}\) for the “incongruent” condition is roughly 3.204 + -0.726, so roughly 2.478. The central predictors corresponding to these linear predictors are:

\[ \begin{align*} \eta_\text{cong} & = \text{logistic}(3.204) \approx 0.961 \\ \eta_\text{incon} & = \text{logistic}(2.478) \approx 0.923 \end{align*} \]

These central estimates for the latent proportion of “correct” answers in each condition tightly match the empirically observed proportion of “correct” answers in the data:

Testing hypothesis for a logistic regression model is the exact same as for a standard regression model. And so, we find very strong support for hypothesis 2, suggesting that (given model and data), there is reason to believe that the accuracy in incongruent trials is lower than in congruent trials.

Notice that the logit function is the inverse of the logistic function. ↩︎

Nick McCullum Headshot

Nick McCullum

Software Developer & Professional Explainer

Logistic Regression in Python - A Step-by-Step Guide

Hey - Nick here! This page is a free excerpt from my new eBook Pragmatic Machine Learning, which teaches you real-world machine learning techniques by guiding you through 9 projects.

Since you're reading my blog, I want to offer you a discount. Click here to buy the book for 70% off now.

In our last article, we learned about the theoretical underpinnings of logistic regression and how it can be used to solve machine learning classification problems.

This tutorial will teach you more about logistic regression machine learning techniques by teaching you how to build logistic regression models in Python.

Table of Contents

You can skip to a specific section of this Python logistic regression tutorial using the table of contents below:

The Data Set We Will Be Using in This Tutorial

The imports we will be using in this tutorial, importing the data set into our python script, learning about our data set with exploratory data analysis, the prevalence of each classification category, survival rates between genders, survival rates between passenger classes, the age distribution of titanic passengers, the ticket price distribution of titanic passengers, removing null data from our data set, building a logistic regression model, removing columns with too much missing data, handling categorical data with dummy variables, adding dummy variables to the pandas dataframe, removing unnecessary columns from the data set, creating training data and test data, training the logistic regression model, making predictions with our logistic regression model, measuring the performance of a logistic regression machine learning model, the full code for this tutorial, final thoughts.

The Titanic data set is a very famous data set that contains characteristics about the passengers on the Titanic. It is often used as an introductory data set for logistic regression problems.

In this tutorial, we will be using the Titanic data set combined with a Python logistic regression model to predict whether or not a passenger survived the Titanic crash.

The original Titanic data set is publicly available on Kaggle.com , which is a website that hosts data sets and data science competitions.

To make things easier for you as a student in this course, we will be using a semi-cleaned version of the Titanic data set, which will save you time on data cleaning and manipulation.

The cleaned Titanic data set has actually already been made available for you. You can download the data file by clicking the links below:

  • Titanic data

Once this file has been downloaded, open a Jupyter Notebook in the same working directory and we can begin building our logistic regression model.

As before, we will be using multiple open-source software libraries in this tutorial. Here are the imports you will need to run to follow along as I code through our Python logistic regression model:

Next, we will need to import the Titanic data set into our Python script.

We will be using pandas' read_csv method to import our csv files into pandas DataFrames called titanic_data .

Here is the code to do this:

Next, let's investigate what data is actually included in the Titanic data set. There are two main methods to do this (using the titanic_data DataFrame specifically):

  • The titanic_data.head(5) method will print the first 5 rows of the DataFrame. You can substitute 5 with whichever number you'd like.
  • You can also print titanic_data.columns , which will show you the column named.

Running the second command ( titanic_data.columns ) generates the following output:

These are the names of the columns in the DataFrame. Here are brief explanations of each data point:

  • PassengerId : a numerical identifier for every passenger on the Titanic.
  • Survived : a binary identifier that indicates whether or not the passenger survived the Titanic crash. This variable will hold a value of 1 if they survived and 0 if they did not.
  • Pclass : the passenger class of the passenger in question. This can hold a value of 1 , 2 , or 3 , depending on where the passenger was located in the ship.
  • Name : the passenger's name.`
  • Sex : male or female.
  • Age : the age (in years) of the passenger.
  • SibSp : the number of siblings and spouses aboard the ship.
  • Parch : the number of parents and children aboard the ship.
  • Ticket : the passenger's ticket number.
  • Fare : how much the passenger paid for their ticket on the Titanic.
  • Cabin : the passenger's cabin number.
  • Embarked : the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

Next up, we will learn more about our data set by using some basic exploratory data analysis techniques.

When using machine learning techniques to model classification problems, it is always a good idea to have a sense of the ratio between categories. For this specific problem, it's useful to see how many survivors vs. non-survivors exist in our training data.

An easy way to visualize this is using the seaborn plot countplot . In this example, you could create the appropriate seasborn plot with the following Python code:

This generates the following plot:

A seaborn countplot

As you can see, we have many more incidences of non-survivors than we do of survivors.

It is also useful to compare survival rates relative to some other data feature. For example, we can compare survival rates between the Male and Female values for Sex using the following Python code:

A seaborn countplot with a Sex hue

As you can see, passengers with a Sex of Male were much more likely to be non-survivors than passengers with a Sex of Female .

We can perform a similar analysis using the Pclass variable to see which passenger class was the most (and least) likely to have passengers that were survivors.

A seaborn countplot with a Pclass hue

The most noticeable observation from this plot is that passengers with a Pclass value of 3 - which indicates the third class, which was the cheapest and least luxurious - were much more likely to die when the Titanic crashed.

One other useful analysis we could perform is investigating the age distribution of Titanic passengers. A histogram is an excellent tool for this.

You can generate a histogram of the Age variable with the following code:

Note that the dropna() method is necessary since the data set contains several nulls values.

Here is the histogram that this code generates:

A histogram of age variables from the titanic data set

As you can see, there is a concentration of Titanic passengers with an Age value between 20 and 40 .

The last exploratory data analysis technique that we will use is investigating the distribution of fare prices within the Titanic data set.

You can do this with the following code:

A histogram of fare variables from the titanic data set

As you can see, there are three distinct groups of Fare prices within the Titanic data set. This makes sense because there are also three unique values for the Pclass variable. The difference Fare groups correspond to the different Pclass categories.

Since the Titanic data set is a real-world data set, it contains some missing data. We will learn how to deal with missing data in the next section.

To start, let's examine where our data set contains missing data. To do this, run the following command:

This will generate a DataFrame of boolean values where the cell contains True if it is a null value and False otherwise. Here is an image of what this looks like:

A DataFrame of boolean values indicating where null data exists

A far more useful method for assessing missing data in this data set is by creating a quick visualization. To do this, we can use the seaborn visualization library. Here is quick command that you can use to create a heatmap using the seaborn library:

Here is the visualization that this generates:

A DataFrame of boolean values indicating where null data exists

In this visualization, the white lines indicate missing values in the dataset. You can see that the Age and Cabin columns contain the majority of the missing data in the Titanic data set.

The Age column in particular contains a small enough amount of missing that that we can fill in the missing data using some form of mathematics. On the other hand, the Cabin data is missing enough data that we could probably remove it from our model entirely.

The process of filling in missing data with average data from the rest of the data set is called imputation . We will now use imputation to fill in the missing data from the Age column.

The most basic form of imputation would be to fill in the missing Age data with the average Age value across the entire data set. However, there are better methods.

We will fill in the missing Age values with the average Age value for the specific Pclass passenger class that the passenger belongs to. To understand why this is useful, consider the following boxplot:

A boxplot of age values stratified by passenger classes

As you can see, the passengers with a Pclass value of 1 (the most expensive passenger class) tend to be the oldest while the passengers with a Pclass value of 3 (the cheapest) tend to be the youngest. This is very logical, so we will use the average Age value within different Pclass data to imputate the missing data in our Age column.

The easiest way to perform imputation on a data set like the Titanic data set is by building a custom function. To start, we will need to determine the mean Age value for each Pclass value.

Here is the final function that we will use to imputate our missing Age variables:

Now that this imputation function is complete, we need to apply it to every row in the titanic_data DataFrame. Python's apply method is an excellent tool for this:

Now that we have performed imputation on every row to deal with our missing Age data, let's investigate our original boxplot:

You wil notice there is no longer any missing data in the Age column of our pandas DataFrame!

You might be wondering why we spent so much time dealing with missing data in the Age column specifically. It is because given the impact of Age on survival for most disasters and diseases, it is a variable that is likely to have high predictive value within our data set.

Now that we have an understanding of the structure of this data set and have removed its missing data, let's begin building our logistic regression machine learning model.

It is now time to remove our logistic regression model.

First, let's remove the Cabin column. As we mentioned, the high prevalence of missing data in this column means that it is unwise to impute the missing data, so we will remove it entirely with the following code:

Next, let's remove any additional columns that contain missing data with the pandas dropna() method:

The next task we need to handle is dealing with categorical features. Namely, we need to find a way to numerically work with observations that are not naturally numerical.

A great example of this is the Sex column, which has two values: Male and Female . Similarly, the Embarked column contains a single letter which indicates which city the passenger departed from.

To solve this problem, we will create dummy variables . These assign a numerical value to each category of a non-numerical feature.

Fortunately, pandas has a built-in method called get_dummies() that makes it easy to create dummy variables. The get_dummies method does have one issue - it will create a new column for each value in the DataFrame column.

Let's consider an example to help understand this better. If we call the get_dummies() method on the Age column, we get the following output:

An example of the pandas get_dummies method

As you can see, this creates two new columns: female and male . These columns will both be perfect predictors of each other, since a value of 0 in the female column indicates a value of 1 in the male column, and vice versa.

This is called multicollinearity and it significantly reduces the predictive power of your algorithm. To remove this, we can add the argument drop_first = True to the get_dummies method like this:

Now, let's create dummy variable columns for our Sex and Embarked columns, and assign them to variables called sex and embarked .

There is one important thing to note about the embarked variable defined below. It has two columns: Q and S , but since we've already removed one other column (the C column), neither of the remaining two columns are perfect predictors of each other, so multicollinearity does not exist in the new, modified data set.

Next we need to add our sex and embarked columns to the DataFrame.

You can concatenate these data columns into the existing pandas DataFrame with the following code:

Now if you run the command print(titanic_data.columns) , your Jupyter Notebook will generate the following output:

The existence of the male , Q , and S columns shows that our data was concatenated successfully.

This means that we can now drop the original Sex and Embarked columns from the DataFrame. There are also other columns (like Name , PassengerId , Ticket ) that are not predictive of Titanic crash survival rates, so we will remove those as well. The following code handles this for us:

If you print titanic_data.columns now, your Jupyter Notebook will generate the following output:

The DataFrame now has the following appearance:

The final DataFrame for our logistic regression model

As you can see, every field in this data set is now numeric, which makes it an excellent candidate for a logistic regression machine learning algorithm.

Next, it's time to split our titatnic_data into training data and test data. As before, we will use built-in functionality from scikit-learn to do this.

First, we need to divide our data into x values (the data we will be using to make predictions) and y values (the data we are attempting to predict). The following code handles this:

Next, we need to import the train_test_split function from scikit-learn . The following code executes this import:

Lastly, we can use the train_test_split function combined with list unpacking to generate our training data and test data:

Note that in this case, the test data is 30% of the original data set as specified with the parameter test_size = 0.3 .

We have now created our training data and test data for our logistic regression model. We will train our model in the next section of this tutorial.

To train our model, we will first need to import the appropriate model from scikit-learn with the following command:

Next, we need to create our model by instantiating an instance of the LogisticRegression object:

To train the model, we need to call the fit method on the LogisticRegression object we just created and pass in our x_training_data and y_training_data variables, like this:

Our model has now been trained. We will begin making predictions using this model in the next section of this tutorial.

Let's make a set of predictions on our test data using the model logistic regression model we just created. We will store these predictions in a variable called predictions :

Our predictions have been made. Let's examine the accuracy of our model next.

scikit-learn has an excellent built-in module called classification_report that makes it easy to measure the performance of a classification machine learning model. We will use this module to measure the performance of the model that we just created.

First, let's import the module:

Next, let's use the module to calculate the performance metrics for our logistic regression machine learning module:

Here is the output of this command:

If you're interested in seeing the raw confusion matrix and calculating the performance metrics manually, you can do this with the following code:

This generates the following output:

You can view the full code for this tutorial in this GitHub repository . It is also pasted below for your reference:

In this tutorial, you learned how to build logistic regression machine learning models in Python.

Here is a brief summary of what you learned in this article:

  • Why the Titanic data set is often used for learning machine learning classification techniques
  • How to perform exploratory data analysis when working with a data set for classification machine learning problems
  • How to handle missing data in a pandas DataFrame
  • What imputation means and how you can use it to fill in missing data
  • How to create dummy variables for categorical data in machine learning data sets
  • How to train a logistic regression machine learning model in Python
  • How to make predictions using a logistic regression model in Python
  • How to the scikit-learn 's classification_report to quickly calculate performance metrics for machine learning classification problems

Companion to BER 642: Advanced Regression Methods

Chapter 10 binary logistic regression, 10.1 introduction.

Logistic regression is a technique used when the dependent variable is categorical (or nominal). Examples: 1) Consumers make a decision to buy or not to buy, 2) a product may pass or fail quality control, 3) there are good or poor credit risks, and 4) employee may be promoted or not.

Binary logistic regression - determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.

Since the dependent variable is dichotomous we cannot predict a numerical value for it using logistic regression so the usual regression least squares deviations criteria for best fit approach of minimizing error around the line of best fit is inappropriate (It’s impossible to calculate deviations using binary variables!).

Instead, logistic regression employs binomial probability theory in which there are only two values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one group rather than the other.

Logistic regression forms a best fitting equation or function using the maximum likelihood (ML) method, which maximizes the probability of classifying the observed data into the appropriate category given the regression coefficients.

Like multiple regression, logistic regression provides a coefficient ‘b’, which measures each independent variable’s partial contribution to variations in the dependent variable.

The goal is to correctly predict the category of outcome for individual cases using the most parsimonious model.

To accomplish this goal, a model (i.e. an equation) is created that includes all predictor variables that are useful in predicting the response variable.

10.2 The Purpose of Binary Logistic Regression

  • The logistic regression predicts group membership

Since logistic regression calculates the probability of success over the probability of failure, the results of the analysis are in the form of an odds ratio.

Logistic regression determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.

  • The logistic regression also provides the relationships and strengths among the variables ## Assumptions of (Binary) Logistic Regression

Logistic regression does not assume a linear relationship between the dependent and independent variables.

  • Logistic regression assumes linearity of independent variables and log odds of dependent variable.

The independent variables need not be interval, nor normally distributed, nor linearly related, nor of equal variance within each group

  • Homoscedasticity is not required. The error terms (residuals) do not need to be normally distributed.

The dependent variable in logistic regression is not measured on an interval or ratio scale.

  • The dependent variable must be a dichotomous ( 2 categories) for the binary logistic regression.

The categories (groups) as a dependent variable must be mutually exclusive and exhaustive; a case can only be in one group and every case must be a member of one of the groups.

Larger samples are needed than for linear regression because maximum coefficients using a ML method are large sample estimates. A minimum of 50 cases per predictor is recommended (Field, 2013)

Hosmer, Lemeshow, and Sturdivant (2013) suggest a minimum sample of 10 observations per independent variable in the model, but caution that 20 observations per variable should be sought if possible.

Leblanc and Fitzgerald (2000) suggest a minimum of 30 observations per independent variable.

10.3 Log Transformation

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality.

  • Log transformations and sq. root transformations moved skewed distributions closer to normality. So what we are about to do is common.

This log transformation of the p values to a log distribution enables us to create a link with the normal regression equation. The log distribution (or logistic transformation of p) is also called the logit of p or logit(p).

In logistic regression, a logistic transformation of the odds (referred to as logit) serves as the depending variable:

\[\log (o d d s)=\operatorname{logit}(P)=\ln \left(\frac{P}{1-P}\right)\] If we take the above dependent variable and add a regression equation for the independent variables, we get a logistic regression:

\[\ logit(p)=a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\] As in least-squares regression, the relationship between the logit(P) and X is assumed to be linear.

10.4 Equation

\[P=\frac{\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}{1+\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}\] In the equation above: P can be calculated with the following formula

P = the probability that a case is in a particular category,

exp = the exponential function (approx. 2.72),

a = the constant (or intercept) of the equation and,

b = the coefficient (or slope) of the predictor variables.

10.5 Hypothesis Test

In logistic regression, hypotheses are of interest:

the null hypothesis , which is when all the coefficients in the regression equation take the value zero, and

the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the null hypothesis.

10.6 Likelihood Ratio Test for Nested Models

The likelihood ratio test is based on -2LL ratio. It is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher’s model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it.

Significance at the .05 level or lower means the researcher’s model with the predictors is significantly different from the one with the constant only (all ‘b’ coefficients being zero). It measures the improvement in fit that the explanatory variables make compared to the null model.

Chi square is used to assess significance of this ratio.

10.7 R Lab: Running Binary Logistic Regression Model

10.7.1 data explanations ((data set: class.sav)).

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

This dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.

There are three predictor variables: GRE, GPA, and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

10.7.2 Explore the data

This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We can get basic descriptives for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.

Before we run a binary logistic regression, we need check the previous two-way contingency table of categorical outcome and predictors. We want to make sure there is no zero in any cells.

logistic regression hypothesis set

10.7.3 Running a logstic regression model

In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.

Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

How to do the interpretation?

For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.

For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.

The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.

Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Later we show an example of how you can use these values to help assess model fit.

Why the coefficient value of rank (B) are different with the SPSS outputs? - In R, the glm automatically made the Rank 1 as the references group. However, in our SPSS example, we set the rank 4 as the reference group.

We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.

The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.

We can also test additional hypotheses about the differences in the coefficients for the different levels of rank. Below we test that the coefficient for rank=2 is equal to the coefficient for rank=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).

The chi-squared test statistic of 5.5 with 1 degree of freedom is associated with a p-value of 0.019, indicating that the difference between the coefficient for rank=2 and the coefficient for rank=3 is statistically significant.

You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.

Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23.

For more information on interpreting odds ratios see our FAQ page: How do I interpret odds ratios in logistic regression? Link:

Note that while R produces it, the odds ratio for the intercept is not generally interpreted.

You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions

We will start by calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means.

These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, we will break it apart to discuss what various components do. The newdata1$rankP tells R that we want to create a new variable in the dataset (data frame) newdata1 called rankP, the rest of the command tells R that the values of rankP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.

In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.

Now, we are going to do something that do not exist in our SPSS section

The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.

It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing.

logistic regression hypothesis set

We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(mylogit) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:

10.8 Things to consider

Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases (a small cell), the model may become unstable or it might not run at all.

Separation or quasi-separation (also called perfect prediction), a condition in which the outcome does not vary at some levels of the independent variables. See our page FAQ: What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them? for information on models with perfect prediction. Link

Sample size: Both logit and probit models require more cases than OLS regression because they use maximum likelihood estimation techniques. It is sometimes possible to estimate models for binary outcomes in datasets with only a small number of cases using exact logistic regression. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.

Pseudo-R-squared: Many different measures of psuedo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds? Link

Diagnostics: The diagnostics for logistic regression are different from those for OLS regression. For a discussion of model diagnostics for logistic regression, see Hosmer and Lemeshow (2000, Chapter 5). Note that diagnostics done for logistic regression are similar to those done for probit regression.

10.9 Supplementary Learning Materials

Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley & Sons, NY.

Burns, R. P. & Burns R. (2008). Business research methods & statistics using SPSS. SAGE Publications.

Field, A (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Los Angeles, CA: Sage Publications

Data files from Link1 , Link2 , & Link3 .

Statology

Statistics Made Easy

Understanding the Null Hypothesis for Logistic Regression

Logistic regression is a type of regression model we can use to understand the relationship between one or more predictor variables and a response variable when the response variable is binary.

If we only have one predictor variable and one response variable, we can use simple logistic regression , which uses the following formula to estimate the relationship between the variables:

log[p(X) / (1-p(X))]  =  β 0 + β 1 X

The formula on the right side of the equation predicts the log odds of the response variable taking on a value of 1.

Simple logistic regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple logistic regression , which uses the following formula to estimate the relationship between the variables:

log[p(X) / (1-p(X))] = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

Multiple logistic regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple logistic regression and multiple logistic regression models.

Example 1: Simple Logistic Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple logistic regression model.

We can use the following code in R to fit a simple logistic regression model:

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall Chi-Square value of the model and the corresponding p-value.

We can use the following formula to calculate the overall Chi-Square value of the model:

X 2 = (Null deviance – Residual deviance) / (Null df – Residual df)

The p-value turns out to be 0.2717286 .

Since this p-value is not less than .05, we fail to reject the null hypothesis. In other words, there is not a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Logistic Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple logistic regression model.

We can use the following code in R to fit a multiple logistic regression model:

The p-value for the overall Chi-Square statistic of the model turns out to be 0.01971255 .

Since this p-value is less than .05, we reject the null hypothesis. In other words, there is a statistically significant relationship between the combination of hours studied and prep exams taken and final exam score received.

Additional Resources

The following tutorials offer additional information about logistic regression:

Introduction to Logistic Regression How to Report Logistic Regression Results Logistic Regression vs. Linear Regression: The Key Differences

Featured Posts

5 Tips for Choosing the Right Statistical Test

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

One Reply to “Understanding the Null Hypothesis for Logistic Regression”

Thank you, thank you, thank you for being clear and concise and working through each step of this, and including the R code!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

IMAGES

  1. 06 2 Logistic Regression Hypothesis Representation

    logistic regression hypothesis set

  2. Logistic Regression

    logistic regression hypothesis set

  3. Logistic Regression

    logistic regression hypothesis set

  4. Logistic Regression

    logistic regression hypothesis set

  5. Writing Hypothesis For Logistic Regression : Using Logistic Regression

    logistic regression hypothesis set

  6. How to Plot a Logistic Regression Curve in R?

    logistic regression hypothesis set

VIDEO

  1. Multiple regression, hypothesis testing, model deployment

  2. Final 08 Simple Regression &Hypothesis TestingWeek6&7&12

  3. Logistic Regression in SPSS in Bangla

  4. 5 Statistics Chapter-5(Correlation vs Regression| Hypothesis)

  5. Lecture 06 (Logistic Regression

  6. 06 2 Logistic Regression Hypothesis Representation

COMMENTS

  1. Understanding Logistic Regression step by step

    The logistic regression classifier will predict "Male" if: This is because the logistic regression " threshold " is set at g (z)=0.5, see the plot of the logistic regression function above for verification. For our data set the values of θ are: To get access to the θ parameters computed by scikit-learn one can do: # For theta_0: print ...

  2. PDF Lecture 13 Estimation and hypothesis testing for logistic regression

    To test a single logistic regression coefficient, we will use the Wald test, βˆ j −β j0 seˆ(βˆ) ∼ N(0,1), where seˆ(βˆ) is calculated by taking the inverse of the estimated information matrix. This value is given to you in the R output for β j0 = 0. As in linear regression, this test is conditional on all other coefficients being ...

  3. 12.1

    The multiple binary logistic regression model is the following: \[\begin{align}\label{logmod} ... (Remember the reduced model is the model that results when the $\beta$'s in the null hypothesis are set to 0.) Thus, the number of $\beta$'s being tested in the null hypothesis is \((k+1)-(r+1)=k-r\). Then the likelihood ratio test statistic is ...

  4. Logistic Regression Explained from Scratch (Visually, Mathematically

    Logistic Function (Image by author) Hence the name logistic regression. This logistic function is a simple strategy to map the linear combination "z", lying in the (-inf,inf) range to the probability interval of [0,1] (in the context of logistic regression, this z will be called the log(odd) or logit or log(p/1-p)) (see the above plot). ). Consequently, Logistic regression is a type of ...

  5. Introduction to Logistic Regression

    Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Tumor Malignant or Benign. ... The hypothesis of logistic regression tends it to limit the cost function between 0 and 1 ...

  6. Logistic Regression in Machine Learning

    The logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function, which maps any real-valued set of independent variables input into a value between 0 and 1. This function is known as the logistic function.

  7. Logistic Regression for Machine Learning

    Logistic Function. Logistic regression is named for the function used at the core of the method, the logistic function. The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment.It's an S-shaped curve that can take any real-valued ...

  8. Notes

    This course introduces principles, algorithms, and applications of machine learning from the point of view of modeling and prediction. It includes formulation of learning problems and concepts of representation, over-fitting, and generalization. These concepts are exercised in supervised learning and reinforcement learning, with applications to images and to temporal sequences.

  9. Understanding Logistic Regression: A Step-by-Step Explanation

    Since the Hypothesis Function outputs the probability of an instance belonging to the positive class, we need to set a threshold (usually 0.5) to classify the instance as either positive or negative.

  10. LogisticRegression

    sklearn.linear_model. .LogisticRegression. ¶. Logistic Regression (aka logit, MaxEnt) classifier. In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr', and uses the cross-entropy loss if the 'multi_class' option is set to 'multinomial'.

  11. Logistic regression

    An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input , and outputs a value between zero and one. [2] For the logit, this is interpreted as taking input log-odds and having output probability.

  12. 15.2 Logistic regression

    Here is how to set up a logistic regression model with brms. ... Testing hypothesis for a logistic regression model is the exact same as for a standard regression model. And so, we find very strong support for hypothesis 2, suggesting that (given model and data), there is reason to believe that the accuracy in incongruent trials is lower than ...

  13. Logistic Regression in Python

    In this tutorial, we will be using the Titanic data set combined with a Python logistic regression model to predict whether or not a passenger survived the Titanic crash. The original Titanic data set is publicly available on Kaggle.com, which is a website that hosts data sets and data science competitions.

  14. Logistic Regression in Python

    Problem Formulation. In this tutorial, you'll see an explanation for the common case of logistic regression applied to binary classification. When you're implementing the logistic regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors ( or inputs), you start with the known values of the ...

  15. Logistic Regression Model: Machine Learning Essentials

    The logistic regression hypothesis limits the cost function to a value between 0 and 1, making linear functions unsuitable for this task. ... the Logistic Regression model is constrained to having a dependent variable that is restricted to a discrete numerical set. Logistic regression requires that there is little to no multicollinearity ...

  16. Machine learning (Part 23)-Hypothesis Representation of Logistic Regression

    Logistic regression is a statistical technique aimed at producing a model from a set of observations to predict values taken by a… May 17 Jamie Crossman-Smith

  17. PDF CHAPTER Logistic regression

    CHAPTER5. Logistic regression. 1 Machine learning as optimization. The perceptron algorithm was originally written down directly via cleverness and intu- ition, and later analyzed theoretically. Another approach to designing machine learning algorithms is to frame them as optimization problems, and then use standard optimization algorithms and ...

  18. Logistic Regression

    Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable (target) is categorical. For example, To predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0) Consider a scenario ...

  19. Chapter 10 Binary Logistic Regression

    10.5 Hypothesis Test. In logistic regression, hypotheses are of interest: the null hypothesis, which is when all the coefficients in the regression equation take the value zero, and. the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the ...

  20. Understanding the Null Hypothesis for Logistic Regression

    The formula on the right side of the equation predicts the log odds of the response variable taking on a value of 1. Simple logistic regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically ...

  21. Hypothesis testing for logistic regression with categorical predictor

    The model is set up as follows: logit(P(Yi = 1)) = β0 + β1I(xi = B) +β2I(xi = C) l o g i t ( P ( Y i = 1)) = β 0 + β 1 I ( x i = B) + β 2 I ( x i = C) I want do a hypothesis test for whether the percentage chance of success is the same given xi = A x i = A, xi = B x i = B, or xi = C x i = C. For example, to test if there is a significant ...

  22. machine learning

    Apr 29, 2020 at 16:30. @MichaelHardy A hypothesis space refers to the set of possible approximations that algorithm can create for f. The hypothesis space consists of the set of functions the model is limited to learn. For instance, linear regression can be limited to linear functions as its hypothesis space. - funmath.

  23. Logistic Regression and Decision Boundary

    The fundamental application of logistic regression is to determine a decision boundary for a binary classification problem. ... be to decide on a proper fit to the decision boundary so that we will be able to predict which class a new feature set might correspond to. ... is the probability estimation or the hypothesis function. Loss/Cost ...