Statistics Classroom - Government of Macao Special Administrative Region Statistics and Census Service

Spam filtering approaches fall into two broad categories: one is by using computer techniques to detect spam and the other is by setting up a spam filtering statistical model*. For now, let’s focus on the latter approach. First, collect a sizable amount of normal emails and spam emails, identify the features of these two types of emails using text mining, e.g. frequencies of occurrence of specified words and symbols, ratio of symbols to words, length of sentence, upper and lower case letters (e.g. in English), etc., and then set up a model based on these features, e.g. if the word qian “錢” occurs up to a certain number of times, its spam probability increases. To determine whether or not an incoming email is a spam, the email is scored using this model by comparing its calculated probabilities with the default values.