Predicting customer satisfaction using Deep Learning

3 min


In 2020 an important bank asked us for help. This bank had the best digital customer service in the industry, daily they were able to answer thousands of customer requests in real-time via chat, Whatsapp, and other channels. However they had a problem, at the end of each conversation they asked the customer how satisfied they were with the service, but only 15% answered the question and with that, they could take loyalty or sales actions.
There were 85% who did not respond and therefore could not trigger concrete actions.
From thousands of saved conversations that had a concrete answer, we trained’s custom detection models to predict customer satisfaction using hundreds of variables generated during a conversation.
Our models combine different data sources to make the predictions, on the one hand, we generate clusters of the conversations and on the other hand, we use the numerical data associated with a conversation such as waiting times, duration of conversations, number of messages, etc.
Using an API provided by Deep Talk, the Bank implemented in its contact center system a special metric that predicted customer satisfaction in each of the conversations attended by the executives, even if the customer did not answer the survey.

The Dataset

The customers evaluate the conversations on a star scale that ranges from either one to five, or from one to seven stars, with more stars relating to higher satisfaction. We divide the conversations into two groups depending on their star rating. Conversations with 1 or 2 stars receive a target value of 0 (conversations with a bad rating), conversations with 4 or 5 stars (6 or 7 for a 7-star rating scale) receive a target value of 1 (conversations with a good rating. We disregard conversations that received other rate values.

We filter out conversations in which the client has not responded to the first response of the executive, as in these cases no real conversation occurred.

We are left with around 120,000 conversations of which 80% fall into the positive group, making the dataset fairly unbalanced. To build the model, we split these conversations into a training, validation, and test set. Here, we create a balanced validation and test set, such that we can use the prediction accuracy as a reliable metric to evaluate our models. When we balance the dataset, we perform a per-tag balancing to respect the origin of all conversations (project-tags such as “Company x”, “password recovering”, …). Hence, our final test dataset is balanced per tag and in its totality.
The imbalance of the training dataset is left as a hyperparameter: During training, we build models with different levels of imbalance to see how the imbalance affects model performance.

For each conversation, we calculate various metadata features (such as average user response time, duration of the conversation, number of tokens used, …) and use Google’s NLP API to further collect sentiments of the conversations. These features are used when building the model in addition to the pure text of the conversation. In the following, we treat sentiment features and other metadata features in the same way. Hence, we will simply say metadata features.


We construct user-dependent text features, meaning that we build separate features for text coming from the user or the client. To turn the text into numerical features we use the tf-idf statistics (term frequency — inverse document frequency) that tries to reflect the importance of a token within its context. The hyperparameters for tokenization (ngram range, preprocessing method, tokenization method, etc) are changed during training and optimized using the validation dataset.
We use SVD (singular value decomposition) to perform dimensionality reduction of the resulting features. Again, the number of components of SVD is left as a hyperparameter to be set with the validation dataset.

The metadata features are scaled into a [0, 1] range, but no further pre-processing is applied. We calculate the median for each feature to substitute missing values with it. During training, we use different subsets of metadata features to keep the number of features limited to avoid over-fitting.

We stack features from text and metadata into one array and train a final classifier with it. Here we have focused on a logistic regression classifier.


The final accuracy that we can achieve using combined metadata and text features lies at 74.9% on the balanced test set. As we are not using sentiment features in deployment (as we would have to connect to Google’s NLP API), a model relying purely on text features is currently deployed. This model has a slightly lower test set accuracy of 74.2%. The model constructs 200 text features per person type (client and user) based on a (1, 3) ngram range.

In general, we find that the derived text features are of much higher importance for the model than the metadata features. However, metadata features help to increase the accuracy by something in between 0.5 to 1%.

Examples for tokens that are most helpful for the model to take a decision are: “muchas gracias”, “nada mas”, “solo eso”, “pero no”, “no tengo”, “no puedo”, “invitamos communique través”, “solo para asistencia”, “sucursales”, “acepta cargo”.

Analysis Metadata

Even though the metadata features do not have a strong impact on the final accuracy of the prediction of customer satisfaction, it is important to understand how customer satisfaction correlates with them. To better understand the significance of the metadata features, we performed an analysis that quantifies when a conversation is twice more likely to receive a negative evaluation. From all conversations, only 20% received a negative evaluation. We were looking for regions in the metadata space in which this rate increases to twice its value. Such an analysis is going to be very useful in the future if we want to do live predictions of customer satisfaction during a conversation. We quickly summarise our findings here by listing the most important metadata features that strongly correlate with a higher probability for a conversation to receive a negative rating.

  • average client sentiment in the last third of the conversation
  • length of last user segment
  • total number of segments
  • time per message
  • average response time to user
  • number of tokens used by the client
  • average response time to client

We assume that in particular these features help to increase the accuracy of the prediction of customer satisfaction, however, we have not yet performed a detailed analysis of this.

Summary and Outlook

We have put into production a model that can predict customer satisfaction when a conversation is closed with an accuracy of 74.2%. This can help the Bank to obtain an improved report of their customer’s satisfaction from past conversations. Only about 28% of all conversations of the Bank receive a rating. This model can help the Bank to better understand how their users viewed the other 72% of their conversations. Besides this informative application, the model can help the Bank to better monitor satisfaction in the future, because they can receive feedback, even if a conversation is not evaluated.

Welcome to our blog