Using Logistic Regression in Machine Learning with Python

Arya Peruma
4 min readAug 11, 2021

In machine learning and statistics, logistic regression is a tool used frequently to create models to summarize the probability of a certain class or event existing. This could be win or lose, pass or fail and many more. Logistic regression is often confused with linear regression, however, they have distinct differences. Linear regression is used to predict trends in data and create models through a “line of best fit”, also calculating sources of errors within the predictions. Logistic regression allows us to predict if a certain circumstance will occur over the other likely possibility. In this article, we will look at this in-depth using tumour data to predict if a patient has cancer or not.

Introduction to Keras in Python

Recapping the primary step when conducting a model analysis, it is always important to import libraries and modules to access specific features. The standard imports are added, such as Matplotlib, pandas, NumPy and seaborn, as well as sklearn which allows us to split our sets into testing and training sets, along with the preprocessing of our data.

On the 7th and 8th lines, we integrate TensorFlow with Keras. Keras is a high-level API of TensorFlow, so we must import them this way.

Our Data

The data being explored is a set consisting of patients’ tumour data. Depending on the tumour size, our data will tell us if the patient is susceptible to cancer or not. Below is the data of the first 10 patients in the dataset. There are 101 patients in total.

Features and Labels

Every aspect of machine learning deals with features. The x values of a dataset are the features, while the y values are the labels. The idea of features and labels carries on to logistic regression, as they are to be split into training and testing sets for both values.

With this data, the features are the tumour size while the labels are the cancer values, depicting if there is cancer or not with 0s and 1s.

Creating the Model

To use logistic regression to predict if a patient has cancer, a model needs to be created. To create a model, we need to add an activation to the model. We do not add this in linear regression, however; We need it for logistic regression to apply the sigmoid function after the linear transformation.

Once the model is created, we compile the function in python, which consists of telling our code what the error is going to be. With this problem, we set it as binary cross-entropy. This compares each of the predicted probabilities to actual class outputs as 0s and 1s as we saw for the labels in our tumour data.

When training this model, we use the model.fit command so our scaled x value and y training value are inputted, and with this, we try to find what to fit into this model. Epochs are how many steps we are going to do of gradient descent, while verbose provides additional information of the background actions of the computer’s computations.

Error Minimization

When talking about error minimization, we need to ensure we do enough iterations to create the most accurate result. J_list is the error after each iteration. If we completed enough iterations, the resulting graph will be linear at the end.

Recall, Precision, and Accuracy

Testing Data
Training Data

Comparing the recall, precision and accuracy of the testing and training data, the precision is much higher in the training set, recall is better in the test set and the accuracy is higher in the training set.

Logistic regression is a helpful tool in data science to narrow down probabilities and create accurate predictions through a developed model, as explained in this article. If you would like to review the concepts explained in this article more thoroughly, check out my video below on this topic.

--

--

Arya Peruma

Founder of Coding For Young Minds, Software Developer, and Machine Learning Researcher.