The problem of unwanted e-mails (or spam messages) has been increasing for years. Different methods have been proposed in order to deal with this problem wich includes blacklists of known spammers, handcrafted rules and machine learning techniques.
In this paper we investigate the performance of the k Nearest Neighbours (k-NN) method in spam detection tasks. At this end, a number of different document codifications were tested.
Moreover, we study how the vocabulary size reduction affects this task. In the experimental design, different k values were considered and results were analyzed with respect to a public mailing list and personal e-mail collections. The experiments showed that results with public mailing lists tend to be very optimistic and they should not be considered representative of those expected with personal user accounts.