Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.


Introduction
Understanding emotions is key for human intelligence emulation and, thus, for the advancement of artificial intelligence. In addition, the opportunity to capture sentiments has gained interest in both the scientific community and the business world, which has led to the emerging fields of affective computing and sentiment analysis [1].
Affective computing [2] is a field of cognitive computing and artificial intelligence, whose objective is to develop systems that are able to recognize, interpret, process, and simulate human emotions. Sentiment analysis (SA) is a suitcase research problem that requires tackling many natural language processing (NLP) tasks [3]. It contains three layers. The first one is a syntactic layer that aims at pre-processing texts and includes tasks such as part-of-speech tagging, lemmatization, and micro text normalization. The second one is a semantic layer that aims at deconstructing the normalized text from the previous layer into concepts, resolve entities, and filter neutral content to improve sentiment classification accuracy. The tasks in this layer are, among others, concept extraction, word sense disambiguation, and subjectivity detection [4]. The last one is the pragmatics layer, focused on extracting meaning from both sentence structure and semantics obtained from previous layers, and it includes tasks such as polarity detection, aspect recognition, sarcasm detection [5], and personality recognition [6]. Medhat et al. proposed another definition for SA [7], stating that SA can be considered as a classification process with three primary classification levels: document level, sentence level, and aspect level, in which the goal is to detect when expressing a positive or negative opinion or sentiment, generally known as polarity detection. Cambria et al. [8] introduced the concept of Sentic Computing as a multi-disciplinary approach to SA, in which both computer and social sciences are combined to better recognize, interpret, and process opinions and sentiments on the Web.
According to Medhat et al. [7], datasets used in SA represent an essential issue. In that study, the authors also point out that the main sources of data are from product reviews; however, SA is applied in other domains as well. The purpose of applying SA techniques may vary depending on the end-user. For example, companies are interested in better understanding their customers and competitors to improve their market share [9]. Buyers, on the other hand, would like to make better purchasing decisions by taking advantage of the opinions of other buyers [10]. Politicians usually are very interested in knowing the public opinion in a timely and accurate manner, enabling better decision-making [11]. Even in medicine, there are SA applications, for example, to identify mentions of personal intake of medicines in tweets [12].
With the advent of Web 2.0, the availability of data sources has increased considerably. Through social media, millions of users worldwide interact with others, sharing their comments and experiences. Moreover, some social media sites also allow users to input additional information along with the text, such as emoticons, thumbs up/down, scores, categories, or some raw emotions.
As stated by Wang et al. [13], there are a lot of social media tools for carrying out sentiment analysis, but they are focus on finding the aggregate-level sentiment, such as sentiment polarity. Nevertheless, authors propose that, if finer-grained sentiment analysis can be achieved, it will yield more specific and more actionable results with detailed negative emotion subcategories such as anger, sadness, and anxiety or positive emotion subcategories such as happiness and excitement.
The terms sentiment and emotion are widely used but usually confused or misinterpreted [14] and have often been used interchangeably; however, sentiments are differentiated from emotions by the duration in which they are experienced. Wang et al. [15] stated that, while sentiments reflect feelings and attitudes, emotions provide a more refined characterization of the sentiments involved. Emotion sensing drills deeper to reveal the exact emotions expressed in the text. In their study, the authors hold that whatever emotion-sensing methodology is used, having a proper categorization model for emotions is always very important. In this sense, the study reviews many of the existing emotion models by considering the view of psychologists, as well as perspectives from social science, computing science, and engineering. The different models surveyed in the study vary in the number of emotions they recognize-some consist of six primary emotions, while others identify up to 24. Additionally, models can be divided between categorical and dimensional.
Among categorical models, Ekman's model of emotions [16] stands out. This model is based on the argument that there are six distinctive facial expressions (plus neutral): anger, fear, disgust, joy, sadness, and surprise. On the other hand, two-, three-, and four-dimensional models can be identified. Two-dimensional models are characterized by valence/arousal. Three-dimensional models incorporate an additional dimension, which varies according to the model in question. Lastly, there are several four-dimensional models, such as The Hourglass of Emotions model [17], which considers sensitivity, aptitude, pleasantness, and attention as dimensions. This model is an affective categorization model, primarily inspired by Plutchik's studies on human emotions, and is a biologically inspired as well as psychologically motivated emotion categorization model.
For the present paper, Ekman categorical model [16] is used, since it is one of the most widely adopted models for affect recognition [17], and because Ekman's basic emotions are somehow related to the Facebook reactions (LOVE, SAD, ANGRY, WOW, and HAHA).
In this work, a comment and a reaction produced by the same user in response to a given post are linked, because the authors assume that a topic may trigger, but usually not express, an emotion, whereas a comment usually conveys the emotion felt by the reader of that topic. Even though, in this case, the link between some of the reactions and Ekman's basic emotions could seem straightforward, it should be noted that the tagging process is not performed in a controlled environment, and the people that tagged the content is not trained for this specific task. In addition, it is presumed that there may be a significant level of noise on the comment-tag association, produced mainly by the presence of trolls, interaction between users, and the edition or deletion of comments and reactions.
The usefulness of basic emotion datasets depends on the reliability of the emotions assigned to the content. The ultimate goal of the users of this kind of datasets is to predict basic emotions, not Facebook reactions. In this regard, it is necessary to establish the strength of the link between those reactions and basic emotions. The use of the reactions in this manner could be seen as a form of distant supervision (DS) [18], in which data are tagged automatically or semiautomatically, using some safe signals already present as proxies. This approach allows building a larger dataset by eliminating the need for extensive manual tagging. Some other studies [19][20][21][22][23][24][25][26][27] have already used Facebook reactions but, unlike this work, they linked the reaction to the topic from which it stemmed.
Furthermore, social network data are usually noisy. They contain many issues, such as casual language, spelling errors, and troll activity. The latter is particularly damaging for the construction of basic-emotion datasets, because trolls usually post and repeat their comments and reactions regardless of the topic discussed, and they usually interact ironically or provocatively. Because of these known issues, there is a need to establish the quality of the datasets constructed from social network data. One way to achieve this is to measure the agreement among raters, regarding the reactions, for a small sample of the dataset. In this study, Fleiss kappa [28] will be used, as it is one of the most widely adopted interrater agreement measures.
This work is focused on the Spanish language because, to the best of the authors' knowledge, there are no studies that build and measure the quality of a distantly supervised tagged dataset in this language by comparing it with full manual tagging. The goal is to provide a valuable dataset that can be used in future studies.
In summary, in this paper, a SA issue, which is the generation of emotion-tagged datasets for the Spanish language, is addressed. The dataset presented is built by applying DS on Facebook comments and reactions, and it is validated using the Fleiss kappa interrater agreement measure and the Sentic Computing tool BabelSenticNet.
The remainder of this paper is organized as follows. "Related Work" reviews the literature on DS, Spanish datasets, and interrater agreement measurement. "Dataset Compilation and Filtering Process" presents the dataset along with the compilation and filtering process. "Experimental Setup" describes the validation process performed. In "Results and Discussion", the results are shown. Lastly, "Conclusions and Future Work" are discussed.

Related Work
Tagged datasets are a key ingredient for developing machine-learning text-based classifiers. Mercado et al. [29] stated that in any automatic text analysis, it is essential that there are adequate datasets available so that the data mining and machine-learning approaches can obtain reliable and informative results. Moreover, according to Lo et al. [30], most of the effort has been made in creating resources for formal languages, used in official communication, while, with the popularity of social media, informal linguistic variants are becoming widespread, and those variants require different considerations for their analysis. The mentioned study focuses on multilingual SA; out of all the approaches, lexicons, tools, and corpora listed, only a few focus on variants of the Spanish language. One exception to the above is SenticNet [31], a concept level knowledge base for sentiment analysis that supports 40 languages, including Spanish, using the tool BabelSenticNet [32].
Nevertheless, most resources available are for the English language-Justo et al. [33] stated that the majority of research in disciplines like SA addresses English, even though 48% of Internet resources are written in other languages. This results in the need for creating resources in other languages as well [30,34]. However, as mentioned before, manual tagging is one of the most time-consuming tasks in the creation of emotional datasets. To overcome this issue, many studies have constructed datasets by using DS [18]. In this model, an already existing noisy label is linked to the content to build a tagged dataset automatically.
According to Roth et al. [35], DS allows creating large amounts of training data at a low cost. As the data obtained are inherently noisy, the most challenging problem is improving their quality by reducing the amount of noise.
DS was applied in the work of Go et al. [36], in which emoticons were used as labels to automatically classify a dataset of tweets into one of three categories, which were positive, negative, and neutral. The latter was discarded, and then Support Vector Machines (SVM), Maximum Entropy (ME), and Naive Bayes (NB) classifiers were trained and tested only with positive and negative tweets. The best accuracy reported by this study, 82.7%, was achieved using a combination of unigram and bigram features with ME and NB.
Bandhakavi et al. [37] used labeled (blogs, news headlines) and weakly labeled (tweets) emotion text to generate an emotion lexicon that jointly modeled both the emotionality and neutrality of documents at word level. Pool and Nissim [19] used Facebook reactions in a DS fashion to train an SVM for emotion detection. Nevertheless, they linked reactions to the original post, which is the most widely adopted association in studies that use Facebook reactions [20][21][22][23][24][25][26][27], rather than associating them to the comment, which is proposed in the present paper. In addition, they did not measure the reliability of the automatic tags.
DS is also beneficial when working with low resource languages, as presented in the work of Refaee [38], in which several experiments with distantly supervised datasets in Arabic were conducted. The author concluded that, for subjectivity classification, DS (emoticon and lexicon-based) outperforms fully supervised methods in this language. However, for sentiment classification (positive vs negative), dataset size had to be expanded and hashtags should also be used as labels. The author also states that the results of DS can be language-dependent and that this type of experiments should be conducted for every language.
In the work of Suttles et al. [39], a dataset of tweets was collected, and then, hashtags, emoticons, and emojis in the tweets were used to tag the dataset in a DS fashion. Having multiple ways to tag the dataset allowed performing crossvalidation of the tagging process using the 2 goodness of fit test. The authors concluded that, with minor exceptions, there was consensus between the tags. The tagged dataset was then used to train machine-learning classifiers that obtained accuracies between 75 and 91% (tested with manually tagged tweets).
Felbo et al. [40] collected a very large dataset of tweets and used the emoticons as noisy labels to train a Long Short-Term Memory (LSTM) model. Emojis were stripped from the text, and the model was trained to learn which emoji was removed.
Since the dataset compiled and analyzed in the present work is in Spanish, literature was reviewed for articles that work with this specific language using DS for automatic basic emotion tagging. However, to the best of the authors' knowledge, most papers use polar tags (or a similar variant including neutral and several degrees of positive and negative). In this way, in the research of Moctezuma et al. [41], a dataset of 18 million tweets in Spanish was classified into positive and negative using Spanish affective lexicons. Martín et al. [42] mapped the rating attached to the comments of a touristic website also into a polar classification. Sandoval-Almazan et al. [27] measured the impact of Facebook posts in political campaigns by collecting Spanish posts together with some statistics, like the number of comments, shares, and reactions. However, in that work, the analysis was carried out only based on the emoticons included in the text of the comments, without analyzing the emotion that the text itself might reflect.
As mentioned, most papers that carry out sentiment classification in Spanish rely on datasets built with manual tagging. Such is the case for most datasets provided or presented at the workshop Taller de Análisis de Sentimientos en Español (TASS) [43], and also for the dataset used at the IberLef 2019 competition [44] that compiled tweets with different variants of Spanish. In total, the latter dataset contains around 15,000 tweets in five Spanish variants. Even though the amount of samples is significant, as it was classified manually, it surely took a lot of time and resources to compile. Datasets of this size have been easily compiled using DS in other languages [36,40,[45][46][47][48][49].
An important aspect when working with DS is the validation of the data sources. As mentioned before, since the requirement of manual tagging is removed, datasets that rely on noisy labels tend to be particularly large. This considerable size may be a problem when validating the data. To solve this, two main strategies are usually adopted.
The first strategy is the 2 goodness-of-fit test. To implement this test, at least two different kinds of labels are used to cross-validate data. As mentioned in [39], this has been done using emojis, emoticons, and hashtags. The agreement among the tags was quite above chance for the majority of the classes. That was also the case in [38] for validating sentiment labels obtained with different approaches.
The second strategy is the Fleiss kappa interrater agreement measure [28]. This is useful for measuring agreement among a fixed number of raters over categorical data. The formula to calculate this measure can be seen in equation (1), where P is the probability of agreement among raters, and P e is the probability of agreement by chance.
The value of moves between − 1 (perfect disagreement) and 1 (perfect agreement). Carletta [50] established > 0.80 as a good reliability, with 0.67 < < 0.80 allowing tentative conclusions to be drawn. However, the author also hints that discourse and dialog phenomena may be more complicated than other types of analysis (such as subject classification on newspaper articles). Hearst [51] suggested that this hint implies that the reliability required for this kind of studies may be justified on being lower. Moreover, it should also be noted that these conclusions were drawn before the era of social networks, and subsequent studies were more permissive with reliability requirements.
Cohen's kappa metric [52], a predecessor to Fleiss kappa metric but limited to two raters, was used in [38] to measure the agreement in the annotations of two newly collected tweet datasets in Arabic. The average was 0.786, indicating substantial agreement. The classes used were positive, negative, neutral, mixed, uncertain, and skip. The last two classes, while useful, tend to improve agreement measure results, as the most challenging content usually falls into them.
In the work of Gambino and Calvo [53], a dataset of 3,572 twitter messages in Spanish was compiled. Then, each tweet of the dataset was classified into one of six basic emotions (love, joy, surprise, anger, sadness, and fear) by four annotators. After the annotation process was completed, the resulting agreement measure was 0.49, indicating moderate agreement.
In SemEval 2019 [54], a dataset of textual dialogs was built for one of the tasks. It consisted of 38,424 dialogs that were manually tagged into four different classes (angry, happy, sad, and others) by seven human raters each. The Fleiss kappa score obtained for the tagged data was 0.59, also indicating moderate agreement. In this case, the class "others" may have helped to improve the final agreement score.
As seen, although Carletta [50] established challenging agreement requirements, most researchers carrying out emotion classification on social networks textual data consider that a moderate agreement in the Fleiss kappa scale is acceptable. This will also be the case in the present study. In addition, although many studies measure the quality of the manual tagging using the Fleiss kappa metric, no studies, in the Spanish language at least, compare the reliability of the datasets tagged using DS versus the ones manually tagged.

Dataset Compilation and Filtering Process
Comments and reactions were collected from Facebook, since it is one of the most widely used social networks, with more than 2,449 million active users worldwide as of January 2020 [55].
Those comments and reactions were taken from the interactions of many different Facebook users that posted across 13 widely read news portals in Argentina, namely, Clarín, La Nación, Página 12, El Cronista, Ámbito Financiero, Todo Noticias, Crónica, CNN en Español, C5N, Agencia Télam, Diario Deportivo Olé, Teleshow, and Infobae. These news portals cover different types of news, and they were chosen due to the variety of topics they cover and because they are among the most widely consumed in Argentina [56]. Each comment-reaction tuple reflects the user interaction with a particular post, by reacting to the post and writing a comment.
Comments posted during a period of 4 years were compiled. This aspect is addressed in "Comment Compilation". However, not all of the collected comments ended up in the final dataset-they went through a selection and filtering process. Figure 1 shows the entire process, from compilation to the final filtering of comments considered as useful. Each step of the process, namely comment compilation, comment tokenization, filtering by token count, filtering by language, and troll filtering, is explained in detail in the following sections. The final structure of the dataset is described in "Dataset Description".

Comment Compilation
The extraction process of comments, reactions, and posts was performed using the Facebook API Graph tool [57]. This tool allows setting the interval for data retrieval, so extraction dates were set to 1st January 2016 until the end of December 2019, i.e., a 4-year extraction period. Then, the results were stored in a relational database.
Using a database, the selection process was more straightforward, since not all of the collected comments were valid. All comments and reactions in posts made by the news portals mentioned above were collected, but not all of those are significant to this study. Comments associated with the reaction LIKE were deliberately excluded from the study, since people generally use it to indicate that they saw that post. Moreover, each Facebook user can interact multiple times with a particular post by writing more than one comment. Therefore, the first comment-the older one-is considered as the purest or most significant for the expressed emotion. Considering all collected comments, before selecting only those that are useful, the number of compiled comments was 20,996,169. However, after considering only those related to the emotions or reactions LOVE, HAHA, ANGRY, and SAD, and then selecting only the oldest contribution from each user (in the case of users posting more than one comment in the same post), the number of comments dropped to 1,716,413.

Comment Tokenization
Each comment consists of tokens, and each token is a sequence of characters. Special characters like white spaces and line breaks separate one token form another. Because of this, a token can be a well-formed word, an emoji, a link, or another kind of composition. Therefore, the first step is the tokenization of the comments. The TweetTokenizer class, from the NLTK library [58], was used for this step. For the rest of the filtering process, links, signs, non-printable characters, and Spanish stop words were not considered as a token when counting them on each comment. For example, consider the following comment: "Señor Olé usted es diabolico." Here, TweetTokenizer obtains six tokens: "Señor", "Olé", "usted", "es", "diabolico", and ".", but only four of those tokens are considered valid because the token "." is a punctuation mark and the token "es" is a Spanish stop word. Consequently, the comment in the example has only four valid tokens.
After tokenization, comments containing zero tokens were excluded as well by excluding links, signs, non-printable characters, and Spanish stop words. After this, the number of useful comments dropped to 1,674,912.

Filtering by Token Count
Not all collected comments are significant and can be linked to an emotion. For example, comments with only two valid tokens or less, in general, are just the name and surname of another Facebook user (when a user tags a friend, for instance). For this reason, a first filter was applied in order to remove all comments with only two valid tokens or less. For example, this filter removes comments like "Mirá http://www.eldes tapew eb.com/le-entre garon -un-segun do-negoc io-la-empre sa-del-jefedel-pami-n3768 7", since it has only one valid token ("Mirá"); or "Claudia Cocco", which consists of only two valid tokens, and is an example of a tagged user.
Consequently, comments tokenized as described in "Comment Tokenization", containing less than three tokens, were excluded from the dataset. After applying this filter, the number of useful comments dropped to 1,261,783.

Filtering by Language
Even though almost all users that interact with the selected news portals comment in Spanish, there are a few comments in other languages. Thus, another filter was applied in order to remove non-Spanish comments. For this process, "Python Bindings to CLD2" library [59], or simply CLD2, was used. The CLD2 language detection process was applied to all remaining comments after the application of the previously described filters, yielding a total of 1,035,045 comments that were written in Spanish.
Since language detection is a complex process that can present false positives, an extra validation step was made using Googletrans [60]. Since the use of Google Translate API is not free, and the number of daily requests for free is limited, a relatively small sample of 1,400 comments, randomly selected from the previous set of Spanish comments, was taken to perform a cross-validation process. All the comments analyzed with Googletrans were recognized as written in Spanish, which is an additional element to trust the results obtained with the CLD2 library.

Troll Filtering
Trolling is an interpersonal antisocial behavior prominent within Internet culture across the world, and Facebook, with more than two billion active users worldwide, has become the Internet's biggest playground for engaging in antisocial behaviors, mainly trolling. Trolling behavior includes starting aggressive comments and posting inflammatory, malicious messages in online comment sections to deliberately provoke, disrupt, and upset others [61].
Those comments and interactions are undesired for this study, since they do not necessarily reflect an emotion or a reaction to a particular topic. Trolls write comments in several posts, frequently the same comment, independently of the topic of the post. Hence, the troll filtering process consists in identifying all the comments that could potentially have been posted by trolls and exclude them from the dataset.
The process was made by first identifying all the comments that appear more than once, and then counting the number of appearances. The process revealed that there were 14,488 comments in the dataset that appeared at least twice. If the entirety of comments collected initially is considered (20,996,169), this number goes up to 237,309 repeated comments. These comments, which represent about 1.399% of the dataset, were excluded. After applying this filter, the number of useful comments dropped to 1,020,557.

Dataset Description
After the selection and filtering process, the dataset, included as Electronic supplementary material 1, was built. Its main characteristics are described below: • Filename: "facebook_automatically_tagged_dataset.  Table 1 shows the maximum, minimum, and average token and character counts in post titles, subtitles, and comments. Table 2 shows the same information, but segmented by reaction.

Dataset Title, Subtitle and Comment Frequency
Other important information to consider about the dataset is the frequency of the instances in terms of the number of tokens and characters in them. Assuming they approximately follow a normal distribution, and by using the empirical rule, all instances around the mean value with a width of two standard deviations were considered for producing more comprehensible histograms. Figures 2, 3, and 4 show the frequency of instances in terms of tokens for titles, subtitles, and comments, respectively. Figures 5, 6, and 7, show character level instead.

Vocabulary Overlapping Level in Comments from Different Classes
Other relevant information about this dataset is how much overlap is among tokens from the different classes, i.e., the reactions related to the comments. Table 3 shows the overlapping level considering unique tokens for each reaction.
For example, this table shows that 27% unique tokens of

Most Common Unique Comment Tokens in Each Class
For the last metric extracted from the dataset, the frequency of each unique term in each reaction was identified, i.e., those terms that are linked only to a specific reaction. The following figures show the word clouds for these terms, considering each reaction. Figure 8 corresponds to the word cloud for comment terms associated with the HAHA reaction, Fig. 9 corresponds to the ANGRY reaction, Fig. 10 represents the SAD reaction and, finally, Fig. 11 shows the LOVE reaction.

Experimental Setup
Over the 1,020,557 remaining elements, a random sample of quadruples (title, subtitle, comment, reaction) was selected; in order to estimate the value of the desired parameter, this is the Fleiss kappa agreement measure [28]  for this particular dataset. To determine sample size, the finite population determination formula was used. This is shown in equation (2), where n is the size of the resulting sample, N is population size, Z is the statistical parameter depending on the confidence level, e is the margin of error, p is the probability for success, and q = (1 − p).
As there were no known proportions from p and q for this dataset, their values were set as 0.5. The confidence level was set at 95% and the maximum allowable error, at 5.23%. These last two parameters were the best possible with the resources available, but still considered acceptable for this work. The resulting sample size was 247.
The sample was split into 10 sets of 24 or 25 quadruples. Each set was then used to build a Google Form [62] with one classification task per quadruple. Set size was determined experimentally, since it was observed that larger sets yielded poorer agreement results; this could be due to human rater loss of concentration.
Hsueh et al. [63] stated that, to carry out the manual tagging phase, it is appropriate to involve experts in the task. Following this suggestion, psychologists were asked to carry out the manual classification task. Participants were shown a news title, a subtitle and comment, and then were asked to select what reaction, among four possible options (ANGRY, SAD, HAHA, and LOVE) the comment conveyed. Participants were allowed to select a second choice, but this was optional. Around 25 psychologists took part of this classification task. Each comment was reviewed at least by three individuals, as annotation quality can be improved through cross-validation and verification by several annotators [30]. After the review, the Fleiss kappa agreement measure was calculated globally, and then, each reaction was considered versus the others, i.e., considering one reaction as a category and the remaining three as another category. Another measure to calculate the agreement, as done in [54], was considering the most voted class for each comment as the valid label. Then, the Fleiss kappa agreement measure was calculated for the original reaction and the reaction that received the most rater votes for each comment. In the case of a draw for most voted reaction, the optional secondary response and the original tag were used, if necessary, to break the tie.
Finally, to gain some insight about challenging cases, comments were analyzed using BabelSenticNet [32] to extract the main concepts and the overall polarity of each class. A global

Results and Discussion
In this section, the main characteristics of the results, obtained after the manual tagging process, are described. These results are included as Electronic supplementary material 2. Overall, results are presented in Table 4, showing that agreement is moderate. The global Fleiss kappa score is 0.49 and, if individual reactions are considered, LOVE is the highest and SAD is the lowest.
As it can be seen in Table 5, if the original reaction in the dataset is considered as another reviewer, the global Fleiss kappa score drops to 0.4426, but still within the moderate agreement zone, the individual reaction with the highest value is still LOVE, but the lowest value is now shared by ANGRY and SAD.
In the results presented in Table 6, to give more weight to the original reaction and to filter possible manual classification outliers, the manually classified reaction for every comment was decided by a vote. Fleiss kappa was then calculated among the most voted reaction for every sample and the original dataset reaction.
The second measure presented in Table 6 also considers the secondary reaction (if selected) as a vote. In the last two measures of the same table, if the voting is tied and the original reaction is among the most voted one, then the voting result is set to the original reaction. As it can be seen, all measures are also within the moderate agreement zone.   To visualize where the disagreements between the original tag and the human raters were, a confusion matrix was built; the result is presented in Fig. 12. As it can be seen, ANGRY was the most accurately predicted but also the one with more false positives. Every reaction was confused with ANGRY, HAHA being the worst case. The remaining three reactions did not present classification problems between each other.
As it can be seen in the confusion matrix, most errors were between ANGRY and the rest of the reactions. To analyze the potential cause for these misclassifications, the Spanish version of BabelSenticNet [32] was used to extract and evaluate the polarity of the concepts mentioned in the comments.
The results are presented in Figs. 13,14,15, and 16, which show a concept cloud for every reaction. The color of the concepts in the cloud indicates their polarity. The more intense the red, the more negative the concept is; the more intense the green, the more positive concept is; gray concepts are close to neutral.
In addition to this, for every reaction, the average polarity of all concepts was also calculated. The reaction with the most negative average polarity is HAHA (-0.00725), followed by ANGRY (0.02047), SAD (0.03534), and LOVE (0.0989). This could explain why psychologists misclassified many of the comments tagged with HAHA as ANGRY. On the other hand, the reaction with the most positive average polarity, LOVE, was the least confused with ANGRY. Table 7 presents some of the most common misclassifications detected, which are HAHA, SAD, and LOVE classified as ANGRY. The other regions of the confusion matrix did not present a relevant number of misclassifications.
In some of the comments tagged as ANGRY but misclassified as HAHA, the author of the comment probably considers that the person mentioned in the topic has little or none credibility, and everything that person does or says makes the commenter laugh. Still, these comments contain some words with a very negative connotation that may have misled the human reviewers. This is the case of comments 4 and 8 in Table 7. Other comments misclassified in this way were sarcastic (such as comment #1). As regards the comments tagged as LOVE but classified as ANGRY, the person who posted the comment agrees with the event reported in the news, but in doing so, they also criticize something or someone. Examples of this behavior can be seen in comments 2, 7, and 9 in Table 7.
SAD and ANGRY comments are difficult to distinguish. An example of this is comment 3 in Table 7. A hint to distinguish them may be that SAD comments are written in a more respectful way than ANGRY comments.
Finally, comments 4, 5, and 6 in Table 7 present obvious hints (highlighted in bold) about what the actual reaction is. The misclassification in those cases could be due to the lack of concentration of the human reviewers. This may mean that the questionnaires should probably be shorter.     Si quieren trabajar qué paguen impuestos como todo comerciantes x eso ellos venden mas barato x que usan espacio publico y no pagan impuesto y de paso se traen la droga para vender en Argentina como nadie controla nada en este pais. Estamos fritos con esta gente If they want to work, they must pay taxes like all shop owners, they offer cheaper prices because they use public space and do not pay taxes, and they also bring drugs to sell in Argentina, since there are no controls for anything in this country. We are hopeless with these people 8 HAHA Title Marcha contra ajuste de planes sociales March against the adjustment of social welfare plans 2/3 as ANGRY Comment El día que hagan un reclamo legítimo capaz que el pueblo los acompañe The day they make a legitimate claim maybe the people will march with them 9 LOVE Title Comienza el juicio a Lázaro Báez por la ruta del dinero K Lazaro Báez's trial begins on K money route 2/3 as ANGRY Comment Tiene que ser rápido las pruebas son contundentes hay pruebas de sobra no sé qué tanto tienen que estudiar It has to be fast, the evidence is conclusive, there is plenty of evidence, I don't know what it is they have to analyze so much

Conclusions and Future Work
In this paper, a new dataset of news, comments, and emotional reactions was presented. The dataset consists of 1,020,557 comments, each one tied to a news article (title and subtitle) and a specific reaction (the true value class). The number of entries is significantly larger than other manually tagged sentiment datasets that have been built for the Spanish language [44], which can be easily achieved by using noisy labels as content tags. However, no studies, at least for the Spanish language, compare the reliability of those tags versus manual tags, nor has any study linked tags directly to the comment instead of linking them to the originating news article. As seen in the previous sections, emotionally tagged distantly supervised datasets can be automatically collected from social media articles. The agreement measured between the human raters is very similar to the one between human raters and the original tag; every measure presented is within the moderate agreement zone, which other authors [53,54] considered suitable for sentiment classification training.
Although the agreement measure is a little lower compared with fully manually classified datasets, larger datasets can be built by using the guidelines presented in this article, as less or none manual tagging is required.
Filtering out duplicate comments and trolls improves the agreement measures presented. Many of the social media users that perform such practices are trolls whose input cannot be trusted.
The ANGRY reaction presented a significant number of false positives; the authors assume that this may be the consequence of unfiltered troll activity, so refining the troll filtering process may help improve this issue. This confusion could also be caused by misinterpreted sarcasm in the comments. Therefore, the presence of sarcastic comments in the dataset should be explored further, and sarcasm detection could be performed following the recommendations of Majumder et al. [5]. In addition, a more detailed polarity analysis by class could be performed by applying Sentic patterns [64] along with BabelSenticNet [32].
The next step of this research is to train machinelearning algorithms that can predict the emotion using a comment as input and can explain it as well [65]. As seen in SemEval 2019 [54], contextual information can be used to improve classification accuracy in textual dialogs; this could also be the case for interactions in social media, as responses to news articles are a form of communication or dialog. The use of semantic information should also be explored, as it may help improve classification accuracy [66].