Flamingo

a sunset pink. cherry blossoms to remind me of fragility. torn like a tornado. light and airy is a dream faraway. heavy mornings lead me to wrapping my arms around the ground beneath me. which seems…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Churn users analysis using Apache Spark

Why are we losing our existing customer?

Sparkify is a music app that offers a paid premium membership along with a free subscription. The company providing the service holds a log of the interactions between each user and the app. This database is a treasure to the marketing team.

With the right analysis of the log data, we can see the interactions between our users and the app. Although the app’s free tier is a great investment for our company, still we would like to focus attention on the patterns of behaviour of the users who bought the premium subscription.

Studying the patterns of interaction between premium subscription users and our app could help us in two ways:

In the customer-oriented domains, It’s difficult to measure success if you don’t measure the inevitable failures, too. While you must aim for 100% of customers to stick with your company from day one until the end of time, that could be unrealistic. That’s where customer churn comes in.

A churned customer is an individual that stopped using your company’s product or service.

The data is composed of a log of all events or actions taken by the users during the app usage. Within every instance in the dataset, the following features are recorded:

The study aims to identify a potential churn user using the provided database. In the given aim, it would be reasonable to train a supervised learning model to identify a potential churn user based on engineered features from the user log database provided.

The tool used for the Analysis is Apache Spark for its strong capabilities in dealing with large data sets.

We have to define a metric to measure our model’s success in capturing a potential churn user. Our performance metric is a classification metric. However, in our certain case, the number of churn users in the dataset is rather small compared to the number of active users, so the accuracy metric is not suitable. I have chosen to use the f1 classification metric to measure the performance of our model.

The three processes are coded as a function with the name of clean with takes the original dataframe and returns a clean dataframe with the three issues corrected. The number of rows in the original dataframe was 286500, the number of rows in the resultant dataframe is 278154. The number of features in the original dataframe was 18; now, it is 20.

Before engaging with the learning process using the dataset, a cleaning process is performed. In this process, three main issues needed to be resolved:

The three processes are coded as a function with the name of clean with takes the original dataframe and returns a clean dataframe with the three issues corrected. The number of rows in the original dataframe was 286500, the number of rows in the resultant dataframe is 278154. The number of features in the original dataframe was 18; now, it is 20.2. Exploring

In this step, a deeper understanding of the features and provisional reasons a user to stop using the service is gained. However, within the dataframe, there is not an obvious label for a user if they are considered churn or not. Hence, a new feature needs to be created to define users are churn or not.

In this step, a deeper understanding of the features and provisional reasons a user to stop using the service is gained. However, within the dataframe, there is not an obvious label for a user if they are considered churn or not. Hence, a new feature needs to be created to define users are churn or not.

Before starting labelling users as churn or not, we have to mark the certain interactions that lead for a user to be churn. In our case, interactions with pages like “Cancellation Confirmation” or “Submit Downgrade” labelled as churn interaction.

I have created a column within the dataframe labelling instances in the user log as churn/non-churn based on interactions with the churn pages, aggregated with maximum value for each user. This step results in labelling all instances of the interaction of a user as churn if they have interacted with these pages; otherwise, they are not churn users.

After defining the churn columns, few explorations are created to analyse the potential churn users.

As we can see from the below chart, choosing the appropriate metric is necessary as the data set is imbalanced. Also, this chart shows the rate at which users leave the service, which is an alarming rate for the marketing team in the company.

From the chart below, the difference between churn and active users across the usage of the app’s pages is rather similar except for the number of interactions with the “Role Advert” page. Interaction with the rolls advert page could portray a rather not pleasant experience while using the app.

% of events by Subscription status

To train a model to identify potential churn user, the dataset has to be user-oriented. A user-oriented dataset would present each instance (row) as a user outlining their respective features that could be used to induce a statistical supervised system to predict potential churn users.

In the feature engineering process, a user-oriented dataframe is created; various features are created using SPARK like:

Moreover, using vector assembler, features are assembled into a single column, combining the features created previously. The combined columns are split into training and testing sets to prepare for modelling with the target column as “Churn”.

Lastly, a standard scaler is initialized and fitted using the training set and transformed into the training and testing sets.

At this point, training and testing sets are ready for modelling.

In this step, three supervised learning models are trained and tested for predicting potential churn users.

The three models have been trained and tested using two techniques:

Logistic regression model results
Support Vector Machine model results

Although at the initial trials, the Logistic regression classifier scored better, there are better chances in the refinement of the Gradient Boosted Trees.

After trials, it has been proven that the Gradient Boosted Trees are the best performing model on the training and testing sets with an f1 score of 0.65.

After choosing Gradient Boosted Trees as our way to learning to predict potential churn users, further refinement of the model is performed to better performance. Slight tuning of the hyperparameters has raised the f1 metric slightly; however, most of the performance tuning and raise in the f1 metric result from tuning the data preprocessing methodology performed in the last step.

In the first model, the iterations number of the model was the default value of 10. Which has been changed to 20 resulting in a higher F1 metric of performance. It’s possible that a higher number of iterations could result in better performance; however, the calculations would be rather higher than the available resources.

Gradient Boosted Trees Model refinement

The first result of the project is a model for predicting potential churn users. The product model is rather a beginning at the process of defining the churn.

The evaluation of model robustness was performed in two ways:

As said in the refinement section, in the first trial, other models could score better than Gradient boosted trees, however, there could be a greater chance of improving the performance of the said model. Which resulted in choosing it.

The second result of the project is understanding the different importances of our created features in predicting if the user is a potential churn.

The most important features in predicting potential churn users are:

This insight could be a great recommendation for the management. An insight into the pattern of interactions the churn users have. This insight aligns with our analytical hypotheses that the quality and time of adverts could be a deciding key for a user to leave the service or not.

The aim of the project is to analyse the behaviour of users who have stopped using our service “Churn users”. Moreover, training a supervised machine learning model to help predict a potential churn user.

Several steps have been taken. Firstly, data analysis regarding the provided dataset. which has helped us hypothesise the reasons why would a user leave our service.

Secondly, several features have been created from the dataset to help the supervised induce a statistical model to predict a churned user.

As the main obstacle that I have faced in obtaining a better model performance are the availability of computing resources, Spark is a rather great tool when it comes to big data computing. I would recommend using the EMR clustering service provided by AWS. This step could give more room to tune hyperparameters, even more, to get closer to a better performance in predicting potential churn users.

Add a comment

Related posts:

Making time management a priority

Organizing time is an essential aspect of achieving success in our personal and professional lives. Time is a precious commodity, and once it is lost, it cannot be regained. Therefore, it is vital to…

MERCADO DE TI EM TEMPOS DE PANDEMIA

Recebi o texto abaixo do Rodrigo Mourão em que ele escreve um pouco sobre o mercado de TI, especificamente para programadores nestes tempos de Pandemia, o que ele fala é uma realidade hoje, será que…

The Journey

The universe had been unfair to her.. “The Journey” is published by Shailesh Kumar in Stories by Shailesh.