Detecting Suicide Risk in Chat Applications

In this report we built a system to detect urgent, high-risk messages in a chat application connecting its users to psychologists. With a group of volunteering psychologists, we categorised segment pairs of conversations into high-risk messages that require urgent attention and normal messages. We annotated 1795 examples out of which 612 are emergencies. We were thus facing an imbalanced classification problem.


Worldwide suicide is the second leading cause of death among the age group of 15 to 29 year olds. Especially the minority group LGBTI suffer from an increased risk of suicide because of discrimination and marginalisation against their expression of identity. Todo Mejora is a Chilean organization that offers professional support from psychologists to teenagers of the LGBTI group through a mobile chat application helping them to overcome the problems they face. The increasing use of this “Hora Segura” asked for a system that can automatically detect urgent, suicide-related messages to assist high-risk cases more rapidly. In this report we show how we create such an emergency detection system, which we treat as a binary classification problem. Our best performing model achieves a recall value of 0.8 and a precision of 0.61, resulting in an F1 score of 0.69 on the test set.


Approximately one million people commit suicide each year. In the age group of 15 to 29 year olds, suicide is the second leading cause of death. It is estimated that these numbers will continue to grow over the next few years. [1]

In Chile, about five people commit suicide every day with the highest numbers occuring in the age group of 25 to 44 year olds. Men are four times more likely to take their lives than women. In the age group of 10 to 14 year olds the suicide rate has doubled between the years 2000 and 2008. [2]

Todo Mejora [3] is a Chilean foundation whose mission is to offer support to children and teenagers that suffer suicidal thoughts arising from discrimination and bullying against their sexual orientation or expression of identity.

There exists consistent evidence that the LGBTI group has an increased risk of suicide, see e.g. [4].

In collaboration with Let’s Talk [5], Todo Mejora offers a mobile phone application connecting its users directly with volunteering psychologists during the “Hora Segura”. Here, those affected can talk openly about their problems and find professional help.

As the number of users surpasses the number of available psychologists and as it is not yet possible to offer support at all times, it may happen that urgent cases of high-risk individuals stay unattended for some time. We want to use Natural Language Processing (NLP) to detect urgent, suicide-related messages automatically. Such an emergency detection system may then trigger the sending of notifications to psychologists to bring their attention to the most serious cases.

In this report we summarise this emergency detection system, which we treat as a binary classification problem. First, we describe the collection of annotated data for the task. We then show how we combine text feature extraction methods, such as Word2Vec and Latent Semantic Analysis, with classification algorithms to build a classification pipeline. We conclude by summarising our results and presenting ways on how to improve this system in the future.


We want to identify messages of high suicide risk in written conversations between a user and a psychologist (agent). The system will analyse conversations in real time and evaluate the suicide risk with every incoming message. The dataset that we collect should therefore be close to the message level. However, we need to account for two characteristics of chat applications.

Figure 1. Definition of segment and segment pair within a conversation.

Chat members often split their line of thoughts into multiple messages. Hence, we should not evaluate each message individually, but group consecutive messages of a user. We call this a “segment”.

Also, messages often include a reaction to the previous message of the other participant. In this specific application, the

agent will often pose a question to the user, who will then reply to this question in his next messages. Hence, the previous segment of the agent is important to put the user’s segment into context.

For these two reasons, we will annotate the pair of agent segment and user segment rather than individual messages. We call this unit a “segment pair”. See figure 1 for a visualisation.

The first segment of the user in each conversation forms an exception to this, as no context exists. As the opening messages of a conversation are of very high importance, we also include these isolated user segments in our annotations.

To build the dataset, we first create all segment pairs of all conversations in our database. As annotating all segment pairs is not feasible, we filter segment pairs by demanding that it contains at least one keyword out of a list of keywords that are strongly correlated with suicide risk.

We upload all selected segment pairs to the data labeling platform “Collaboratron” [6], where we ask our volunteers to decide whether the given example shows high risk of suicide and needs urgent attention.

Here, we treat the problem as a binary classification problem.

We have collected 1795 annotated segment pairs, out of which 612 are of high risk and need urgent attention. Out of these we form a training dataset with 1079 examples and validation and test set with 358 examples each. In each of these sets 34.1% of the examples are emergencies.

Besides this set of annotated segment pairs, we are going to use segments from all conversations of Todo Mejora in our database. This unsupervised dataset consists of 600,000 segments (agent and user segments) and will be used to train text feature extractors, such as word2vec and doc2vec.

In this section we have presented the relevant datasets. In the next section we will explain the details of the classification pipelines.


Having presented the dataset in the previous section, we will now show how we use the annotated data to build a classification pipeline.

To turn text into numerical feature vectors, we will use three different approaches. These are Latent Semantic Analysis (LSA) [7], word2vec [8] and doc2vec [9]. The obtained feature vectors will be passed to a classification algorithm. Here, we will test logistic regression and support vector machines. We are using the gensim package for the implementation of word2vec and doc2vec, for all other algorithms we use python’s scikit-learn library.

LSA and doc2vec will automatically return one feature vector for each segment pair. As word2vec works on a word-by-word basis, we will sum all word vectors of a segment pair and normalise the resulting vector to have norm of unity.

As there are two participants in the conversation, we can create feature vectors for the two roles independently and stack them to train the classifier. We will compare this approach to extracting one feature vector for the concatenated text of the two participants. We refer to these two approaches as “independent user texts” and “concatenated user texts”, respectively.

We will combine each of the text feature extractors with logistic regression or a support vector machine as classifier. The text features will originate either from the concatenated agent and user texts, or will be the stack of the independent features from agent and user text.

First, we train the text feature extractors on the unsupervised dataset of all conversation segments. We then use the extracted features of the supervised dataset to train the final classifier. We perform hyper parameter optimisation on the validation set.

For comparison we also test performance using word embeddings previously trained on the Billion Spanish Word Corpus (SBWC) and the Spanish Unannotated Corpora (SUC) [10, 11, 12]. We tested embeddings with dimensionality of 100 and 300.


We are mostly interested in detecting emergencies in the messages and call the messages that were annotated as emergencies the “positive” class. A true positive is thus a correctly identified emergency. We use precision, recall and F1 score on the validation and test sets to compare the different models. We have:

precision = TPTP + FP,

recall = TPTP + FN,

F1 = 2 * precision * recallprecision + recall,

where TP, FP, and FN indicate the number of true positives, false positives and false negatives, respectively.


We present the results on validation and test set in table 1 in terms of the metrics presented in the previous section.

Out of the feature extractors we find word2vec to achieve the best validation and test set performances. This holds true for both cases of concatenated and independent user texts. LSA is clearly outperformed by the two other feature extractors.

The test set performance is generally lower than the validation set performance, indicating that some overfitting exists. This effect is strongest when using LSA as feature extractor and sizeable for the pre-trained word2vec model. When using custom trained word2vec or doc2vec the effect becomes smallest if SVM is used as classifier. In general, collecting more data and increasing the validation and test set sizes should further mitigate the effect.

We tested various pre-trained word embeddings and list the best performing model in table 1. Even though our dataset is small, we found a word2vec model of 300 dimensions to perform best [10]. As one might expect, its overfitting increases dramatically when using concatenated user texts.

The unsupervised dataset used to fit our custom word and document embeddings is relatively small when compared to, e.g., the SBWC or SUC. However, as the application is very specific, the custom models still outperform word embeddings trained on significantly larger text collections.

Overall, we do not see a clear advantage when using independent user texts. Nevertheless, the best performing model with strong consistency between validation uses word2vec to extract independent user text features, classifying them with a SVM. It achieves a recall value of 0.80 on the test set (meaning it identifies 80% of all emergencies) and a precision of 0.61 (meaning that 61% of all detected emergencies were annotated as emergency).

Discussion and Conclusions

In this report we built a system to detect urgent, high-risk messages in a chat application connecting its users to psychologists. With a group of volunteering psychologists, we categorised segment pairs of conversations into high-risk messages that require urgent attention and normal messages. We annotated 1795 examples out of which 612 are emergencies. We were thus facing an imbalanced classification problem.

The data was used to build a binary classification pipeline to detect emergencies in the chat application in real time. The best model was found to be a combination of a word2vec feature extractor with an attached Support Vector Machine as classifier. It achieved a precision of 0.61, a recall of 0.80 and an F1 score of 0.69 on the test set.

We investigated whether performance increases if we treat texts sent by the psychologist or the user independently from each other. On the current dataset, we could not find a significant improvement when using independent user groups, but we will continue testing this approach with an increasing dataset.

We also tested classification performance when using pre-trained word embeddings. We found a 300 dimensional word2vec model to perform best [10], but with an F1 score lower than our custom trained text feature extractors.

The emergency detector was put into production and now attaches automated tags to conversations helping the agents to quickly revise tagged cases and prioritize them if really urgent. Also, the system could send notifications to psychologists, if an urgent case appears out of the hours of attendance.

The detector can be improved in various ways. On the model building side we have not yet completed a comprehensive review of all possible models. For example, we would like to tune pre-trained deep learning models, such as ULMFiT, which have shown great results even with small datasets.

Besides increasing the size of the dataset, more work could also go into engineering features from the text. For example, if a user message contains references to a specific ideation of suicide, it is very likely that urgent attention is needed. A specifically designed feature that detects these references could be crucial to increase performance further.

Besides, we do not take into account the chat history of a specific user, which may contain important details about the person, such as, among others, past suicide attempts, suffering of (cyber)bullying, or the strong use of alcohol.

Other features, such as the time the message was sent may also carry valuable information.

We believe that the system developed in this report is a great first step to ensure attendance of urgent, suicide-related conversations.


[1] World Health Organization, 2019, retrieved 29 September from

[2] Ministerio de Salud. Plan Nacional Salud Mental, 2017, retrieved 29 September from


[4] Suicide and Suicide Risk in Lesbian, Gay, Bisexual, and Transgender Populations: Review and Recommendations, Ann P. Haas, Mickey Eliason, Vickie M. Mays et al., 2011, Journal of Homosexuality, 58:1, 10–51, DOI: 10.1080/00918369.2011.534038



[7] Thomas Landauer and Peter Foltz and Peter W. Laham and Darrell, Introduction to Latent Semantic Analysis, 1998, Discourse Processes. 25 (2–3): 259–284

[8] Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space, 2013, arXiv:1301.3781

[9] Andrew M. Dai and Christopher Olah and Quoc V. Le, Document Embedding with Document Vectors, 2015, arXiv:1507.07998

[10] Cristian Cardellino: Spanish Billion Words Corpus and Embeddings (March 2016),


[12] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information