Last time I promised to write about tokenisation and training. In between I classified the existing comments on my blog as spam or ham. This turned out to be easier than I thought, because I had deleted all the spam before the time that I turned on my test spam plugin, which marks all comments with 0--neutral--instead of NULL--unclassified.
step 4: tokenisation
I start by creating a new script based on the one from last time. Once I get a classifier running I will merge the two and start using feedback from the classifier when updating the training--I'll only have to manually touch comments that the classifier gets wrong. For now it's easier to keep the two scripts separate, though. I also decide to only use comment text for now. I'll replicate my approach to the other fields later. If the classifier is as good as I hope it is, the comment text should be a good indicator anyway. Like last time, I process each row, throwing away the parts I don't want. This time I produce a pair of (Int, String) to hold the comment and its spam score. The code looks like this:
tuplify [id,spam,name,email,url,comment] = (read spam :: Int, comment)
The type annotation is needed for now but will disappear later when the type inferencer has more context. Now I can start on the tokeniser. This means regex. Yes, it's annoying and incomplete, but it's also also the easiest solution. I start with a simple
import Text.Regex (splitRegex, mkRegex)
tokenise = splitRegex (mkRegex " ")
This doesn't work very well, so I use Paul Graham's definition from A Plan for Spam. It's a little out-dated, but still good enough for my purposes, so I will stick to his description pretty closely. My implementation of Graham's tokeniser looks like this:
tokenise =
splitRegex (mkRegex "[^a-zA-Z0123456789$'-]") & map strip & filter (/="")
where strip = dropWhile (=='\\n')
The definition of strip is hoky, relying on the fact that newlines can only appear at the beginning of a string because of the way that the regex engines inteprets my splitter regex. I thought that some utility library I had installed had a real strip, but this turns out not to be the case, and the only definition on Hackage is part of a gigantic utility library that I don't want to chance installing right now. Anyway, with a tokeniser defined, I add it to tuplify:
tuplify [id,spam,name,email,url,comment] = (read spam :: Int, tokenise comment)
Then I divide the comments into spam and ham (first throwing out any still on the fence at 0):
import Data.List (partition)
readCSV =
parseCSVFromFile "evo_training.csv"
>>= fromCSV
& map tuplify
& filter (fst & (/=0))
& partition (fst & (>0))
& return
Then I drop the scores and count all the spam/ham words into a histogram:
readCSV =
parseCSVFromFile "evo_training.csv"
>>= fromCSV
& map tuplify
& filter (fst & (/=0))
& partition (fst & (>0))
& both (concatMap snd & histogram)
& return
histogram is already defined in my utility library (so are both and & in case you were wondering):
histogram :: (Ord a) => [a] -> Map.Map a Int
histogram l = foldl' (\\ m x -> Map.insertWith' (+) x 1 m) Map.empty l
both f (x,y) = (f x,f y)
infixl 9 &
f & g = g . f -- also called >>> in Control.Arrow
step 5: scoring
Now all I need to do is come up with a single score for both: Paul Graham defines this in terms of the usage rates in both spam and ham. His definition seems a little wonky to me, but I copy it for now because it at least worked for him. For each word w in the two histograms I just produced, rs = ws / spam-emails and rh = wh / ham-emails * 2. The probability of spam is then rs / (rs + rh). I need to track an additional piece of information: the number of each kind of e-mail. This information is available after the partition step and disappears after the concatMap step, where the words from all e-mails are combined. So I refactor readCSV to call a new function score after the partition:
readCSV =
parseCSVFromFile "evo_training.csv" >>= fromCSV
& map tuplify
& filter (fst & (/=0))
& partition (fst & (>0))
& score
& return
In score I destructure the pair of lists, measure their length, then call the concatMap step. Then I create a nested function score' which has access to the lengths:
score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
where bigS = fromIntegral $ length spam
bigH = fromIntegral $ length ham
score' (spam,ham) = undefined
Score' turns out big and ugly, because I have to account for words that happen in both corpura, only the ham corpus and only in the spam corpus. (Words that appear in neither will be handled on the classifier side.) To make this work I chop the map keys into three sets, then call assignScore with a different function for each of them. spamonly and hamonly are simple, while bothScore uses the equation I gave above.
score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
where bigS = fromIntegral $ length spam
bigH = fromIntegral $ length ham
score' (spam,ham) =
let spamkeys = Map.keysSet spam
hamkeys = Map.keysSet ham
spamonly = spamkeys `Set.difference` hamkeys
hamonly = hamkeys `Set.difference` spamkeys
spamham = hamkeys `Set.intersection` spamkeys
assignScore f set =
Map.fromList $ Set.toList $ Set.map (\\ k -> (k, f k)) set
bothScore k = ((rs / (rs + rh)) - 0.5) * 200
where rs = min 1.0 ((fromIntegral (spam Map.! k)) / bigS)
rh = min 1.0 (2 * (ham Map.! k |> fromIntegral) / bigH)
in Map.unions [assignScore (const 100) spamonly,
assignScore (const (-100)) hamonly,
assignScore bothScore spamham]
step 6: write CSV
I need to turn each row back into a list of strings so that I can write it to disk as CSV. That's easy, although it seems like boilerplate:
columnify (word,score) = [word, show score]
Now main is simple:
main = readCSV >>= map columnify & printCSV & writeFile "evo_classifier.csv"
This produces the final code:
import Text.CSV (parseCSVFromFile, printCSV)
import Data.List (partition)
import qualified Data.Map as Map
import qualified Data.Set as Set
import Text.Regex (splitRegex, mkRegex)
import Util
main = readCSV >>= map columnify & printCSV & writeFile "evo_classifier.csv"
readCSV =
parseCSVFromFile "evo_training.csv" >>= fromCSV
& map tuplify
& filter (fst & (/=0))
& partition (fst & (>0))
& score
& Map.toList
& return
fromCSV (Left parseError) = error (show parseError)
fromCSV (Right rows) = filter (length & (==6)) rows
tuplify [id,spam,name,email,url,comment] = (read spam :: Int, tokenise comment)
columnify (word,score) = [word, show score]
tokenise =
splitRegex (mkRegex "[^a-zA-Z0123456789$'-]") & map strip & filter (/="")
where strip = dropWhile (=='\\n')
score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
where bigS = fromIntegral $ length spam
bigH = fromIntegral $ length ham
score' (spam,ham) =
let spamkeys = Map.keysSet spam
hamkeys = Map.keysSet ham
spamonly = spamkeys `Set.difference` hamkeys
hamonly = hamkeys `Set.difference` spamkeys
spamham = hamkeys `Set.intersection` spamkeys
assignScore f set =
Map.fromList $ Set.toList $ Set.map (\\ k -> (k, f k)) set
bothScore k = ((rs / (rs + rh)) - 0.5) * 200
where rs = min 1.0 ((fromIntegral (spam Map.! k)) / bigS)
rh = min 1.0 (2 * (ham Map.! k |> fromIntegral) / bigH)
in Map.unions [assignScore (const 100) spamonly,
assignScore (const (-100)) hamonly,
assignScore bothScore spamham]
Next time: I'll create a antispam plugin skeleton, split comments into tokens, and look up the tokens in the training database.