Learn You A Haskell For Great Good

by sandersn 28. May 2011 08:30

 

If you visit learnyouahaskell.com, you'll see that Learn You a Haskell For Great Good originated as a Haskell tutorial when the author decided he wanted to make sure he really knew Haskell. Haskell tutorials are a dime a dozen, especially monad tutorials, but that's exactly what it is; an introductory Haskell tutorial followed by a monad tutorial. It has the distinction that it is very well explained and much more complete than the average tutorial, as well as now existing in dead tree form. But it is not like Real World Haskell, which explains concepts in the order a practising coder would need to get a useful program off the ground. Learn You A Haskell For Great Good assumes that you are a practising coder in an imperative language, but generally explains concepts in order from simplest to most complex.

 

That's cool if you want a summary of the language, and if you're approaching Haskell as an enthusiast rather than a practitioner then you will probably like it. The focus is on expanding your mind, not (for example) finding the right tradeoffs between safe and succinct. It's arguable that this is the right approach for Haskell, which was traditionally a research playground, and the that the people behind Real World Haskell are engaging in wishful thinking when they try to make Haskell a production language.

 

Personally I prefer the practical approach (I've even used Haskell for scripting (!) before), but the tutorial approach has its advantages; it's a lot more thorough, and it covers cool new stuff in favour of things like database libraries. Even after 2-3 years of Haskell use, I learned interesting new things about Haskell's kinds, applicative functors and zippers.

 

The book is divided into two tuturials: basic Haskell and monads. The basic Haskell section is not too surprising--it covers the syntax and semantics of the basic constructs of the language. I think it could have done with a bit more emphasis on solving problems recursively, because the primary difficulty of learning Haskell is not the syntax. Really, it assumes that you are an intelligent person who is willing to work through new concepts yourself. If you don't already know at least one of: Lisp*, Erlang or ML*, then for additional help, you might want look at the Little Schemer, which does a great job of teaching recursive thinking without much language overhead.

 

*Popular members of the Lisp family are Scheme, Common Lisp and Clojure. Popular members of the ML family are F#, Caml and SML.

 

The other half of the book is a monad tutorial. It's tied with Real World Haskell for the best tutorial I've read. Instead of starting with ugly code and cleaning it up, Learn You a Haskell starts with functors, then progresses through applicative functors, monoids and on to monads. Although I suspect that this order does not recapitulate the history of these libraries, it makes a lot of sense. You get to see how functors generalise `map` on lists, then how applicative functors allow arbitrary functions to be lifted and applied in a functor (as well as providing left-to-right evaluation, which is something I missed when I was learning monads), finally ending with the way that monads allow chaining of functions to produce an imperative style.

 

I don't know if this is how the web site presented it--it takes a lot of space, but I think that it's worth it. If you have bounced off monad tutorials in the past, this one may work for you. You'll do a lot of reading, but you'll also end up with a wide, solid base for understanding monad usage.

 

Besides, if you get bored, there's always the fact that Learn You a Haskell is Officially Funny. However, the humour quickly fades into the background. This is no _why's Poignant Guide--the humour is closer to the occasional cracks in my ancient Learn Visual J++ 1.1 book. Well, I guess the drawing are pretty cool. I can't believe Capcom let them include a cartoony sketch of Quickman, but it's AWESOME. (There's also a Wiggler from Mario used to demonstrate head/tail.) Unfortunately, the drawings too fade into the background as they become less topical, or at least start referring to pop culture that I haven't heard of. Overall, it reminds me a lot of O'Reilly-quality writing. That's good but (and here I'm biased) not too surprising coming from the extraordinarily literate Haskell community.

 

Although the writing is good, it's hard for me to estimate how good the book is at teaching, since I already know Haskell. Fortunately, there is a chapter on zippers tucked in at the end, so I can use that to estimate for the rest of the book. Here, I easily acquired an intuitive understanding of what zippers are and how they work--the examples and analogies were good. But I didn't feel like I got a concrete understanding of their implementations, or when I'd need to implement my own. I'll need to actually play with zippers myself to understand them properly.

 

Bottom line: I didn't think too much of the basic Haskell tutorial, but I liked the monad tutorial. It's worthwhile if you have the patience to read the whole thing. Compared to the other two Haskell books I own, Haskell: The Craft of Functional Programming  (Thompson) and Real World Haskell (O'Sullivan, Goerzen & Stewart), Learn You a Haskell For Great Good fills the niche for smart programmers who already know an imperative language and want to learn Haskell for its own sake, covering concepts thoroughly whether or not they will be practically useful. The result is a more focussed book that covers less.

Tags: , , , ,

Books | Language | Programming

Fing 0.2 module-level overview

by sandersn 12. December 2010 02:42

It's been about six months since I blogged about Fing. I've only touched the code a couple of times since then, so I will restart the series by describing the module organisation as it stands today, in the process of developing version 0.2

Fortunately, I wrote all this down six months ago on the fing github wiki, so if you want the details, including module contents, look there. This is a module-level overview.

Util

Utilities. That usually means: utilities that ship with Haskell but not with F#.

Types

The basic types, plus basic operations on them: fold, map, string formatting.


Everything below this line depends on Util and Types (except for Opt).

ParsedTypes

string → internal type representation. For example, user entered types.

FSharpTypes

F# Powerpack Metadata → internal type representation.

CSharpTypes

C# Reflection representation → internal type representation. This is not implemented yet; maybe it should be called DotNetTypes instead.

Search

Search for functions matching a given type.


Everything below this line depends on everything above this line (mostly).

Fing

The public interface: string → list<result> and related operations.

Opt

Command line parsing. I can't believe this isn't part of the standard library.


Program depends on Fing and Opt

Program

Command line interface. To be augmented later by other kinds of interfaces, probably an ASP one next.

Most of my time so far has been refactoring and working on the woefully incomplete tests. Earlier I made a conscious decision to push out a number of features without testing so that I'd have something people could use. Now I'm coming back through and writing the missing tests.

My approach was to build a decent database of types for testing. Most of my tests then assert that a property holds about all the types. This is a lot like QuickCheck, but without the random generation. I personally think it serves me better, because I manually generated all the complex cases of the grammar without recursion on the obvious cases. I could be missing stupid-simple cases, of course. (Also I didn't want to mess with installing F#'s port of QuickCheck—without typeclasses, I assume that it's kind of klunky and painful.)

For example, my newest test asserts that, for all types, if you shuffle the order of type variables, then run Types.index, you will get the original type back. This holds because (1) Types.index overwrites type variables with ones in canonical order and (2) all my test cases were manually created with type variables in canonical order. Actually (2) is a bad idea; it's OK for parser testing but is overly simple when you start testing semantics.

The more I think about it, the more I think I may need something like QuickCheck, at least some hacky fuzzing of my existing, over-regular test cases. That should probably be the next major step in the testing of the code, even if F# Quickcheck is painful to install.

Tags: , , , ,

Language | Programming

Writing a spam classifier for b2evolution (part 4)

by sandersn 4. November 2010 02:17

Part 4 isn't a full post--just a couple of notes:

Good: The spam filtering works a little better than my pessimistic assessment from part 3. It figured out that bulletin boards are commonly hijacked, so index.php and member.php are both good indications of spam. Also, .de sites fall on the spam side of things for some reason. Sorry, Germany!

The numbers are still pretty low, so I just changed the deletion limit from 100 to 5. If you post and your comment is rejected, I'm sorry. The surefire way to get past the filter is to (1) leave out your URL or (2) register with the blog, which shows your user URL on the comment instead of the comment URL; the comment URL is then set to NULL.

Bad: URLs are hard to train on because they tend to looklikethis, which is differentfromthis, even though both strings contain 'this'. One way to get around this is to break the URLs into words. I checked around the internet for something like this and didn't find it, but I may have just been looking for the wrong thing. I know machine learning approaches for this are common in Chinese and Emad worked on a similar vowel-insertion problem for Arabic. Another way would be a simplified version of the unsupervised grammar induction that Menno van Zaanen does.

Two problems with this idea: (1) I've got a little bored with this already, and (2) I'd rather restart work on Fing. F# is a lot more fun to hack in than PHP, and I can't see how to make a word-breaking classifier simple enough to make it worthwhile in PHP.

Tags: , ,

Programming

Writing a spam classifier for b2evolution (part 3)

by sandersn 30. October 2010 19:21

step 7: create plugin skeleton for b2evolution.

I actually did this several weeks ago, cobbling together a skeleton from the default antispam plugin and the b2evo plugin documentation. So I'm going to skip the details, show the skeleton, and work from there.

<?php
class bayes_antispam_plugin extends Plugin
{
  var $name = 'Bayesian Antispam';
  var $code = '';
  var $priority = 60;
  var $version = '0.1-dev';
  var $author = 'Nathan Sanders';
  var $group = 'antispam';
  var $number_of_installs = 1;
  function GetSpamKarmaForComment(array &$params)
  {
    $comment = $params['Comment'];
    if($comment->get_author_email()=='not.my.real.address@yahoo.com') {
      return -100;
    } else {
      return 0;
    }
  }
}
?>

I don't know PHP, but I guess those configuration parameters are static fields. Maybe they override some defaults, since we're using implementation inheritance here. Yay PHP, it's caught up to late 90s software engineering practise. Anyway. Like late 90s Java, PHP has an equivalent of Javadoc. I start from the b2evo API documentation for the basic_antispam_plugin and poke around until I learn that I need to override GetSpamKarmaForComment, which takes an associative array reference called params, which contains a Comment object with a number of getters. The function returns a score from -100 (ham) to 100 (spam). I decide that commenters using my yahoo address get a free pass for now so I can test the plugin. I upload and install the plugin, then create a comment. Sure enough, it gets a -100 score.

step 8: detour: prototype classifier

At this point I realise three things: (1) I am about to start writing real PHP (2) I have a bunch of new spams since last time and most important (3) I don't have any idea of how well the classifier is going to work. I end up taking out 10 spams and 10 hams out of the data set for evaluation purposes, then writing the classifier in Haskell first. This turns out to be a good decision because it's a lot easier to prototype locally. I'll blog about the code later, but the result is that I find out that URLs are a better indicator of spam than the comment text. At a high-level, this makes sense because the URL field is the where the spammers deposit their payload. However, the bad performance of the comment text makes me wonder if I should be using more sophisticated features or maybe a more sophisticated classifier. I think the third option, insufficient training data, is the most likely, though.

step 9: add classifier database structure

I override GetDbLayout so b2evo will prompt me to set up the correct database structure. I include tables for message content and author url. I still hope to get message content working better.

  function GetDbLayout() {
    $prefix = Plugin::get_sql_table();
    return array("CREATE TABLE IF NOT EXISTS `${prefix}content` ( `word` MEDIUMTEXT NOT NULL, `score` FLOAT( 8, 4 ) NOT NULL DEFAULT  '0.0', PRIMARY KEY (  `word`(256) ) ) ENGINE = MYISAM;",
                 "CREATE TABLE IF NOT EXISTS `${prefix}author_url` ( `word` MEDIUMTEXT NOT NULL, `score` FLOAT( 8, 4 ) NOT NULL DEFAULT  '0.0', PRIMARY KEY (  `word`(256) ) ) ENGINE = MYISAM;");
  }

I also have to bump the version number to 0.2-dev so that b2evo will notice the change.

var $version = '0.2-dev';

step 10: write classifier

The classifier turns out to be easy. Although my PHP is really clumsy, I don't want to spend time learning the full database API. Or PHP, for that matter. So I modify some example code.

  function GetSpamScore($sample, $table) {
    $prefix = Plugin::get_sql_table();
    $ws = preg_split("/[^a-zA-Z0-9$']/", $sample);
    if(count($ws) == 0) {
      return -40.0;
    }
    $total = 0.0;
    foreach($ws as $i => $w) {
      $escaped_w = mysql_real_escape_string($w);
      $r = mysql_query("SELECT score FROM ${prefix}${table} WHERE word='${escaped_w}'");
      $count = 0;
      while($row = mysql_fetch_assoc($r)) {
        $n = $row['score'];
        $count++;
        break; // only want the first one
      }
      if($count == 0) {
        $total += -20.0;
      } else {
        $total += $n;
      }
    }
    $total /= count($ws);
    return $total;
  }
  function GetSpamKarmaForComment(array &$params)
  {
   $comment = $params['Comment'];
    return (bayes_antispam_plugin::GetSpamScore($comment->get_author_url(),
                                                "author_url")
            + bayes_antispam_plugin::GetSpamScore($comment->get_content(),
                                                  "content")) / 2;
  }

As you can see, all I do is look up each word in the database and find the average spam score. There's not too much to say, except that I hope mysql_real_escape_string is the real real string escaper. Unfortunately, the real-world performance is pretty bad. The results are scaled way too small and, worse, the classification isn't very good either. Plus, between writing part 2 and part 3, the spam comments turn from a stream to a deluge. Instead I've got comments turned off right now and am looking at my options now that spam is a real problem. I think the best two are addition of a custom CAPTCHA ("enter the word 'orange'") or switching to a blog engine with built-in spam filtration.

Tags: , ,

Programming

Writing a spam classifier for b2evolution (part 2)

by sandersn 29. October 2010 11:57

Last time I promised to write about tokenisation and training. In between I classified the existing comments on my blog as spam or ham. This turned out to be easier than I thought, because I had deleted all the spam before the time that I turned on my test spam plugin, which marks all comments with 0--neutral--instead of NULL--unclassified.

step 4: tokenisation

I start by creating a new script based on the one from last time. Once I get a classifier running I will merge the two and start using feedback from the classifier when updating the training--I'll only have to manually touch comments that the classifier gets wrong. For now it's easier to keep the two scripts separate, though. I also decide to only use comment text for now. I'll replicate my approach to the other fields later. If the classifier is as good as I hope it is, the comment text should be a good indicator anyway. Like last time, I process each row, throwing away the parts I don't want. This time I produce a pair of (Int, String) to hold the comment and its spam score. The code looks like this:

tuplify [id,spam,name,email,url,comment] = (read spam :: Int, comment)

The type annotation is needed for now but will disappear later when the type inferencer has more context. Now I can start on the tokeniser. This means regex. Yes, it's annoying and incomplete, but it's also also the easiest solution. I start with a simple

import Text.Regex (splitRegex, mkRegex)
tokenise = splitRegex (mkRegex " ")

This doesn't work very well, so I use Paul Graham's definition from A Plan for Spam. It's a little out-dated, but still good enough for my purposes, so I will stick to his description pretty closely. My implementation of Graham's tokeniser looks like this:

tokenise =
  splitRegex (mkRegex "[^a-zA-Z0123456789$'-]") & map strip & filter (/="")
    where strip = dropWhile (=='\\n')

The definition of strip is hoky, relying on the fact that newlines can only appear at the beginning of a string because of the way that the regex engines inteprets my splitter regex. I thought that some utility library I had installed had a real strip, but this turns out not to be the case, and the only definition on Hackage is part of a gigantic utility library that I don't want to chance installing right now. Anyway, with a tokeniser defined, I add it to tuplify:

tuplify [id,spam,name,email,url,comment] = (read spam :: Int, tokenise comment)

Then I divide the comments into spam and ham (first throwing out any still on the fence at 0):

import Data.List (partition)
readCSV =
  parseCSVFromFile "evo_training.csv"
  >>= fromCSV
  & map tuplify
  & filter (fst & (/=0))
  & partition (fst & (>0))
  & return

Then I drop the scores and count all the spam/ham words into a histogram:

readCSV =
  parseCSVFromFile "evo_training.csv"
  >>= fromCSV
  & map tuplify
  & filter (fst & (/=0))
  & partition (fst & (>0))
  & both (concatMap snd & histogram)
  & return

histogram is already defined in my utility library (so are both and & in case you were wondering):

histogram :: (Ord a) => [a] -> Map.Map a Int
histogram l = foldl' (\\ m x -> Map.insertWith' (+) x 1 m) Map.empty l
both f (x,y) = (f x,f y)
infixl 9 &
f & g = g . f -- also called >>> in Control.Arrow

step 5: scoring

Now all I need to do is come up with a single score for both: Paul Graham defines this in terms of the usage rates in both spam and ham. His definition seems a little wonky to me, but I copy it for now because it at least worked for him. For each word w in the two histograms I just produced, rs = ws / spam-emails and rh = wh / ham-emails * 2. The probability of spam is then rs / (rs + rh). I need to track an additional piece of information: the number of each kind of e-mail. This information is available after the partition step and disappears after the concatMap step, where the words from all e-mails are combined. So I refactor readCSV to call a new function score after the partition:

readCSV =
  parseCSVFromFile "evo_training.csv" >>= fromCSV
  & map tuplify
  & filter (fst & (/=0))
  & partition (fst & (>0))
  & score
  & return

In score I destructure the pair of lists, measure their length, then call the concatMap step. Then I create a nested function score' which has access to the lengths:

score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
  where bigS = fromIntegral $ length spam
        bigH = fromIntegral $ length ham
        score' (spam,ham) = undefined

Score' turns out big and ugly, because I have to account for words that happen in both corpura, only the ham corpus and only in the spam corpus. (Words that appear in neither will be handled on the classifier side.) To make this work I chop the map keys into three sets, then call assignScore with a different function for each of them. spamonly and hamonly are simple, while bothScore uses the equation I gave above.

score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
  where bigS = fromIntegral $ length spam
        bigH = fromIntegral $ length ham
        score' (spam,ham) =
          let spamkeys = Map.keysSet spam
              hamkeys = Map.keysSet ham
              spamonly = spamkeys `Set.difference` hamkeys
              hamonly = hamkeys `Set.difference` spamkeys
              spamham = hamkeys `Set.intersection` spamkeys
              assignScore f set =
                Map.fromList $ Set.toList $ Set.map (\\ k -> (k, f k)) set
              bothScore k = ((rs / (rs + rh)) - 0.5) * 200
                where rs = min 1.0 ((fromIntegral (spam Map.! k)) / bigS)
                      rh = min 1.0 (2 * (ham Map.! k |> fromIntegral) / bigH)
          in Map.unions [assignScore (const 100) spamonly,
                         assignScore (const (-100)) hamonly,
                         assignScore bothScore spamham]

step 6: write CSV

I need to turn each row back into a list of strings so that I can write it to disk as CSV. That's easy, although it seems like boilerplate:

columnify (word,score) = [word, show score]

Now main is simple:

main = readCSV >>= map columnify & printCSV & writeFile "evo_classifier.csv"

This produces the final code:

import Text.CSV (parseCSVFromFile, printCSV)
import Data.List (partition)
import qualified Data.Map as Map
import qualified Data.Set as Set
import Text.Regex (splitRegex, mkRegex)
import Util
main = readCSV >>= map columnify & printCSV & writeFile "evo_classifier.csv"
readCSV =
  parseCSVFromFile "evo_training.csv" >>= fromCSV
  & map tuplify
  & filter (fst & (/=0))
  & partition (fst & (>0))
  & score
  & Map.toList
  & return
fromCSV (Left parseError) = error (show parseError)
fromCSV (Right rows) = filter (length & (==6)) rows
tuplify [id,spam,name,email,url,comment] = (read spam :: Int, tokenise comment)
columnify (word,score) = [word, show score]
tokenise =
  splitRegex (mkRegex "[^a-zA-Z0123456789$'-]") & map strip & filter (/="")
    where strip = dropWhile (=='\\n')
score (spam,ham) = score' $ both (concatMap snd & histogram) (spam,ham)
  where bigS = fromIntegral $ length spam
        bigH = fromIntegral $ length ham
        score' (spam,ham) =
          let spamkeys = Map.keysSet spam
              hamkeys = Map.keysSet ham
              spamonly = spamkeys `Set.difference` hamkeys
              hamonly = hamkeys `Set.difference` spamkeys
              spamham = hamkeys `Set.intersection` spamkeys
              assignScore f set =
                Map.fromList $ Set.toList $ Set.map (\\ k -> (k, f k)) set
              bothScore k = ((rs / (rs + rh)) - 0.5) * 200
                where rs = min 1.0 ((fromIntegral (spam Map.! k)) / bigS)
                      rh = min 1.0 (2 * (ham Map.! k |> fromIntegral) / bigH)
          in Map.unions [assignScore (const 100) spamonly,
                         assignScore (const (-100)) hamonly,
                         assignScore bothScore spamham]

Next time: I'll create a antispam plugin skeleton, split comments into tokens, and look up the tokens in the training database.

Tags: , ,

Programming

Writing a spam classifier for b2evolution (part 1)

by sandersn 16. October 2010 15:07

My blog is swamped with spam, and b2evolution has a spam plugin framework, but no plugins for classifying spam! It ships with a lame unintelligent one that doesn't even use the comment text. So I am going to write one myself. This involves two parts: one-time training and server-side classification. I intend for the training to be the most complicated, because the classification will need to be written in PHP.

Step 1: download the existing comments for training.

I use CSV, because the output of CSV parsers (when they work) is fairly simple.

Step 2: remove/add columns and tokenise

At this point, I have two options: the built-in csv library of Python or Text.CSV on Hackage. Cabal is still working after six months of no updates (shocking, I know) so installing Text.CSV is no problem. I decide to experiment with both. Both require a little trickery: neither understand the non-standard \\" escaping used by the phpMyAdmin SQL-to-CSV dump. The resolution is fairly simple for both. In Python, add escapechar:

rows = list(csv.reader(open("evo_comments.csv"),
                       delimiter=',', 
                       escapechar='\\\\'))

Text.CSV doesn't have any options; it just implements RFC 1480. Welcome to Haskell, where we value correctness over flexibility! (And both over simplicity, but not in this case). Here the fix is to replace \\" with "":

$ cp evo_comments.csv evo_comments2.csv
$ emacs evo_comments2.csv
M-S-5: \\" -> ""

Since I prefer Haskell for data processing, I'll continue with Text.CSV. I start with this skeleton code so I can do some exploring at the REPL:

import Text.CSV (parseCSVFromFile)
import Util
main = return . process =<< parseCSVFromFile "evo_comments2.csv"
process (Left parseError) = error (show parseError)
process (Right rows) = rows

Note that main is not a valid entry point yet since it's not type IO (). Also process is the wrong name right now. So I with rows <- main I have the comment data loaded as [[String]]. Now to retain only the data I am interested in:

  1. id
  2. spam rating (-100==ham .. 100==spam)
  3. name
  4. url
  5. e-mail
  6. comment

Plus I will add a spam field during output. I briefly consider additional fields, like author ID, which is used for members instead of name/url/e-mail. However, I guess that a NULL name/url/e-mail will be just as good an indication of non-spam, since the spammers don't get much of a payload unless they fill these fields. The easiest way to re-arrange the fields in Haskell is a giant match over a list. This is unsafe so I filter the list for only complete rows after I discover that the last row in the file is a one-field entry of zero length. This gives me the following code:

main =
  parseCSVFromFile "evo_comments2.csv" >>= fromCSV & map columnify & return
fromCSV (Left parseError) = error (show parseError)
fromCSV (Right rows) = filter (length & (==17)) rows
columnify [id,_,_,_,_,name,email,url,_,_,comment,_,_,_,_,spam,_] =
  [id,spam,name,email,url,comment]

I am running out of time for today, so I decide to skip the tokenisation and move on to step 3: write code to spit back out to CSV again. After I have a reduced CSV, I'll manually tag the spam for training purposes. I'll do the tokenisation just before the training stuff.

Step 3: Spit the data back to CSV for annotation

Writing the data back to CSV is simple: just use Text.CSV.printCSV combined with writeFile. I rename my former main to readCSV and create a real main that can be used as an entry point. My full code looks like this:

import Text.CSV (parseCSVFromFile, printCSV)
import qualified Data.Set as Set
import Util
main = readCSV >>= printCSV & writeFile "evo_training.csv"
readCSV =
  parseCSVFromFile "evo_comments2.csv" >>= fromCSV & map columnify & return
fromCSV (Left parseError) = error (show parseError)
fromCSV (Right rows) = filter (length & (==17)) rows
columnify [id,_,_,_,_,name,email,url,_,_,comment,_,_,_,_,spam,_] =
  [id,spam,name,email,url,comment]

Next time: step 4: tokenisation and step 5: training.

Tags: , ,

Programming

First impressions of Expert F# 2.0

by sandersn 11. September 2010 14:39
I've read one chapter of Expert F# 2.0 and skimmed four so far. For an 'expert' book, the first four chapters were pretty much the basics of the language.
  1. The best part so far has been the asides and advice. They are pithy and make interesting points that are not purely didactic, but they also explain how to use a feature instead of just what the feature is.
  2. The rest of the book looks good, too, I just haven't got to the parts where the main text is actually advanced.
  3. The type-setting looks like it was done by monkeys. Off of a proof written in Word! If this is what paper books look like these days, give me e-books for sure. The main font isn't so bad, it just looks vaguely stupid. But it's sized up and down subtly in places, and those places look super messy. The letter spacing is all off. Plus, the code examples are in Consolas, which is an awesome screen font and a dumb print font. And the code examples are randomly italicised--I think it's meant for FSI output, but the script went wrong.
  4. F# has a lot of duplication. It's got to be super-confusing for people who don't already know both Haskell/Caml AND C#/Java.
  5. The units-of-measure feature is basically newtype with special features for numbers. Since, with a module system, half of newtype's uses go away, that works out pretty well.
As mentioned above, I'd recommend the Kindle version or the ebook that Apress sells. You're not losing any beautiful typesetting on this one.

Tags: , , ,

Books | Programming

Code as data outside Lisp

by sandersn 19. June 2010 07:10

A recent Stack Overflow question is "Do you know of a language with Static Type checking where Code is Data?

In other words, is the untyped nature of Lisp integral to code as data?

The answer is no and no. First, there are typed Lisps with macros, as well as other typed languages with similar macro systems. But they tend to be inconvenient, because typing is a bolt-on to Lisp and it's hard to type homogenous lists of symbols, so typed languages tend to supply complicated parse trees instead.

Second, code as data does not *have* to look like a macro system based on quoting homogenous lists of symbols. In fact, outside Lisp, most languages uses closures to manipulate code as data instead. The methods are quite different, but the tasks are often similar.

Here <a href="http://lispm.dyndns.org/">Rainer Joswig</a> disagrees with me. He is the <a href="http://stackoverflow.com/badges/141/lisp">highest rated Common Lisper</a> on Stack Overflow, so it's fair to say he's an expert on Lisp. I think that the crux of our disagreement is that I say code-as-data need not involve Lisp-like quotation, and hence macros.

--------

`quote` is *also* for literal data, but the question is about code-as-data. That's a subset of `quote`'s uses in Lisp, but in other languages, practical code-as-data looks quite different, typically involving manipulation of closures. Imitating Lisp's macros is not practically useful because there's no (known) way to make them convenient for use outside an s-expression language.

The relation of macros to the question is getting complicated and tenuous, so I posted on my blog. I would appreciate further discussion there, though.

http://www.sandersn.com/blog/Code as data outside Lisp[]()

---------

I am thinking of this in a task-based way based on the question: if you want to do some transforms on code in a non-Lisp language, you need a way to capture the parse tree somehow. All the ways I listed allow you to do this, but only (as far as I know) F# quotations return the parse tree at runtime. In a Lisp, to capture the parse tree, use `quote`.

The reason macros are relevant at all to this discussion is that they provide a convenient way to manage this quotation, and in the place that you'd most want to manipulate parse trees: for extending the compiler. That's why, without additional support, F#'s expressions are a curiosity useful only on rare occasions, perhaps when you need to control the entire compilation pipeline.

Mainly, that's because higher-order programming replaces many uses of quotation, including most of the uses that macros are good for.

--

Note: Rainer's site is pretty cool. It says

Tags: , , , , , ,

Programming

F# Warts

by sandersn 17. June 2010 09:04

Now that I've been using F# for real code for almost a month, I've noticed some warts. That's pretty fast for a new language, but I am coming from Haskell, which is very similar. So some of these are warts on Haskell too, and others are complaints that F# is not Haskell.

Warts

  1. let for top-level functions instead of top-level pattern matching. Haskell's pattern matching is the right default, but F# is too much an ML to change it. (I bet that Don Syme thought seriously about it at some point, though.)
  2. F# is a huge language because of .NET/ML dual heritage. There is massive redundancy so that both .NET people and ML people can pick up the language easily. It evokes memories of C++. At least F# 3.0 is unlikely to bolt on a whole NOTHER programming paradigm. I hope.
  3. As a side effect of the dual heritage, type syntax is not iconic like Haskell's. Again, I'm sure the F# designers considered a change, but .NET programmers are already used to list<int> and, unlike Haskell, you also have the related types int[] and seq<int>.
  4. Another side effect of the dual heritage is that there are LOTS of keywords. There are a lot to memorise and usually two or three matching keywords to type every time you use them. while/do, for/in/do, match/with, etc.
  5. Infix functions and prefix operators are not as convenient as Haskell. This is really minor, but F# doesn't have operator sections, and you have to fake the infixing backquotes with a reverse/pipe operator pair. Given the other syntax borrowed from Haskell, I had hoped that at least operator sections might make it in.

Those are mostly syntactic warts that I noticed. Nothing major, and CERTAINLY much improved from O'Caml, whose designers have paid very little attention to syntax. The semantic warts are a little more serious.

  1. The top-to-bottom, left-to-right compilation and baroque type system* restrict the order of expression somewhat.
  2. The monomorphism restriction means a lot of lambda-lifts (or type annotations).
  3. Even more lambda lifts occur because methods and properties are not functions (and the type checker gets dramatically dumber around OO constructs).
  4. A baroque OO type system instead of type classes, plus a love of explicitness, means that monads are much less generic/polymorphic than in Haskell. Theoretically, this is a huge demerit, because it makes it hard to write, for example, a truly generic mapM or sequence. Practically, however, monads are used less in F# anyway, and F# prefers explicitness—witness how frequently modules are used for namespacing compared to Haskell. So this is a matter of taste.. Have you heard of the BASIC monad? Typeclass abuse in Haskell can be quite scary.**

Examples

  1. No top-level pattern matching:
    let rec merge firstArg secondArg =
      match firstArg,secondArg with
      | [],[] -> ...
  2. Dual .NET/ML heritage:
    // struct
    type Point3D =
      struct
        val x: float
        val y: float
        val z: float
      end
    // record
    type Point3D = { 
      x : float; 
      y: float; 
      z: float; 
    }
  3. Non-iconic types: list<int> instead of [int] and int * string instead of (int,string).
  4. Too many keywords: while/do, for/in/do, match/with, class, struct, do, do!, yield, yield!, let, let!, return, return!, fun, function, try/finally, try/with...it's a good thing that the indentation-based light syntax at least gets rid of the block-ending keywords.
  5. No infix/prefix flexibility:
    Seq.append (Seq.filter ((=) foo) x) seq2
  6. Strict order of compilation:
    let f l = l |> Seq.filter ((<>) foo)
    let main = printfn "%s" (f [1;2;3])
    // `main` has to appear after `f` since it references `f`. 
    // But that means the type of `l` can't be (completely) inferred 
    // `because `f` has no context for it.
    
  7. Monomorphism restriction:
    // both fail because generic functions as values are not allowed
    let f = g >> h >> i
    let csv = sepBy (pchar ',') (many anyChar)
  8. No 1st-class methods/properties:
    // both fail because neither properties nor methods 
    // can be used as functions
    let xvalues = Seq.map Point3D.x points
    let allsplits = Seq.map (string.Split [| ',' |]) filelines
  9. No generic monads:
    let sequence : jQuery15207476938851177692_1327176005662? = ???
    I'm pretty sure even if you could write generic code for sequence, the type is impossible to express in .NET: list<'m<'a>> -> 'm<list<'a>>.

Workarounds

  1. No top-level pattern matching:
    let rec map f = function
    | [] -> []
    | (x::xs) -> f x :: map xs
    // NOTE: Example only! Do NOT write List.map this way.
        
    The last argument is the most likely to be pattern-matched anyway, probably about 90% of the time.
  2. Dual .NET/ML heritage: Same as C++: Learn the whole language, then subset.
  3. Non-iconic types: Get used to it. It's not that bad. Or just use arrays everywhere; you don't usually write array<int>.
  4. Too many keywords: Subset! But this is not much of a workaround.
  5. No infix/prefix flexibility:
    Seq.filter ((=) foo) x |>Seq.append<| seq2
    But that only gives you infix functions, it doesn't make prefix operators any prettier.
  6. Strict order of compilation: Put everything in a class—I think? Even if this solves the problem, it's not a natural solution for some problems.
  7. Monomorphism restriction:
    // make a syntactic function or annotate type
    let f x = x |> g |> h |> i
    let csv : Parser<char,unit> = sepBy (pchar ',') (many anyChar)
  8. No 1st-class properties or methods:
    let xvalues = points |> Seq.map (fun p -> p.x)
    let allsplits = filelines |> Seq.map (fun s -> s.Split [| ',' |])
    Note that type inference doesn't work unless the sequence being mapped are to the left of the function wrapping the property or method. There has been a tiny bit of discussion about specific syntax for this, so that something like this would be possible:
    let xvalues = points |> Seq.map Point3D#x
    let allsplits = filelines |> Seq.map (#Split [| ',' |])
    
  9. No generic monads: I don't know that there is a workaround for this. I thought I had blogged earlier about a workaround, but I can't find it now. In any case, this may not be a practical problem, since F# is generally used for larger systems with more developers than Haskell. There, a extra explicitness/verbosity is a decent tradeoff.

So far I like F# quite a bit. It's really easy to pick up if you already know Haskell and/or O'Caml, and state is convenient at times, especially in an eager language. The semantics are solid. I'm really happy that Microsoft includes F# with Visual Studio now. They're pushing the idea that the mainstream language doesn't have to be a over-simplified lowest-common-denominator. *It's a perfectly good type system—for an OO type system. For a functional language (Hindley-Milner style), though, an OO type system adds a lot of complexity without giving as much as power as type classes (for example). **I can't find it now, but the basic idea is this: make a Num instance that wraps the Continuation monad, storing the "line number" in a map Int -> Continuation. Define a function gotothat takes an int and looks up that continuation, calling it. I think. That's conjecture, because I never saw the code, and I can't find anything online. I think it was a joke going on the one time I stepped into #haskell.

main = do
  100 (putStrLn "Hello World")
  200 (goto 100)

Tags: , ,

Programming

Developing Fing, part 4: command line interface

by sandersn 16. June 2010 08:33

The previous entry in this series finished (sort of) the parser. Two things have been hurting my motivation to continue blogging. First is that I needed to blog about the F# version of the parser, since the version I've already written about is Haskell. I don't really want to write about the same thing twice. Second, and more important, is that I was busy working on the actual code instead of blogging about it. There is a decent version online at http://github.com/sandersn/fing. It's still missing three major features (web app, subtyping and type aliases), but it works for everything in FSharp.Core. Now I'm at the point of bringing tests up-to-date with the code, so, you guessed it, I have more motivation to blog again. But I still don't have the gumption to talk about the parser. I'm sick of parsing for now, so maybe I'll write it up in six months, or when I have time to revamp the core token parser, which is super ugly. Instead I'm going to write about simple stuff: the command line interface. It's straightforward, but there is at least one interesting principle in there. I'll work up to the more complicated parts of the code later.

module Main

[<EntryPoint>]
let main args = 
  let args,argmap = Opt.parse args
  let getrefs s = Seq.choose id (Opt.mapGet s argmap Seq.empty)
  let references = getrefs "r" |>Seq.append<| getrefs "reference"
  Fing.addReferences references
  match args with
  | [| t |] -> Fing.textSearch t
  | _ -> printfn "Fing is F# API Search. (rest of usage message...)"

There isn't much surprising here: first I call Opt.parse to produce a list of arguments and keyword arguments. Then I define a function to arguments matching a string s, and use it to build a sequence of "-r" and "--reference" arguments. Well, wait a minute, that's kind of weird, a nested function. I could have done it this way:

  let rs = Seq.choose id (Opt.mapGet s argmap Seq.empty)
  let references = Seq.choose id (Opt.mapGet s argmap Seq.empty)
  Fing.addReferences (rs |>Seq.append<| references)

And that's not badstyle, it just isn't ideal. In the second way, I name the two values, repeating the code that generates them. In the first way, I name the code that generates the two values and repeat only the name I gave it. There's a little more repetition the second way, with not much benefit. (I also get a chance to show off the infix-function trick that I discovered. I'm still very proud of that.) This illustrates one of the subtler aspects of functional programming. The three aspects of functional programming that I think of first are:

  1. Immutability
  2. Functions as data
  3. Functions everywhere

Well, 'functions as data' looks like a subset of 'functions everywhere', but both have a specific meaning. 'functions as data' refers to uses of functions that doesn't occur at all in imperative or OO style: passing them around, saving them in data structures. 'functions everywhere' refers to the size and frequency of functions, which is far higher than in imperative or OO styles. Even though the ideal for all styles is many small functions (or methods), functional languages make it natural to achieve this ideal. It is definitely not natural in normal imperative, structured code, and all the "short methods are good for you" advice books indicate to me that it's not natural in OO either. In any case, if you compare functional languages and OO languages, I think you will find a much greater percentage of the former make short function definition natural. So, to finish up:

  match args with
  | [| t |] -> Fing.textSearch t
  | _ -> printfn "Fing is F# API Search. (rest of usage message...)"

There's not much to say here, except that I am glad pattern matching works on arrays and I wonder why it doesn't work on sequences. I can think of two reasons: first, seq (aka IEnumerable) is an interface, and pattern matching is a concrete operation. Second, it might be inefficient to pattern match the tail of a sequence. I don't grok the efficiency model of iterators, but I think that, because they're not defined recursively, a pattern-match, which treats them as recursive, must create a new enumerator to point to the tail of the matched sequence. Regardless, it's an inconvenience and likely one that can be fixed with an active pattern. That's another F# feature I still need to learn.


Post-script: I had to write Opt.parse myself. I couldn't find one built-in to .NET. Maybe I was just didn't search hard enough, but all I found was half-baked code online. I started with one example, intending to translate it to F# in order to understand it. Once I understood it, I saw that it was buggy and incomplete, and wrote my own from scratch. (I just want back, read the comments below the code, and found a link to a reflection/annotation-based library called Consolery. Nothing in the standard library though. Anyway, continuing with my earlier theme of "I'll blog about whatever I want to blog about", I'll probably talk about Opt.parse next. You might be able to help me iron out some of its ugly parts.

Tags: , ,

Language | Programming