Nathan Sanders : Journal*

First Entry: 18 May 2004 Anno Domini

View 2007 Ascending chronological order Random order
Sorted order Reverse sorted order Rot-13 Encrypted
Reverse order Vowels removed All of the above
RSS Feed View all comments View all

4 Sep : I have installed a Real Blog (b2evolution) over at http://www.sandersn.com/blog. I do not like it much so far, but it has lots of features. Among them: comments. I will see if I can turn the rest of the features off, and perhaps remake the comment system so that it agrees with my philosophy. As I can't stand edit boxes, I will continue to use Emacs, meaning that this file will continue to grow, although perhaps in a more haphazard way.

1 Sep : A quick note about naming conventions. In imperative languages, functions are always verbs (with some modifiers). This is because they DO something. A lot of them return something too, but most of them DO something and return nothing. This is different from functional languages where functions RETURN something and rarely DO anything. The line between code and data is in a different place than in an imperative language. So the naming conventions change accordingly. Functions are often nouns.

This is nowhere more obvious than 'get'. 'get' is ze prefixe de rigeur in Java convention because all data is encapsulated in some object and only methods can give you access to that data. In a functional language, however, data is named by a noun, whether or not you have to call a function to get it. And some languages, like Haskell, make it rather unclear whether a zero-argument function really is just data in the first place.

For example, I just wrote a function called find_unaligned_chars. But after a little reflection (and a little line-wrap) I changed it to unaligned_chars. Of course you're finding them—what would be interesting is if you deleted them or appended them to something. The 'find' verb is just redundant in this case because it's the default case for functional programming. By the way, here is where unaligned_chars is used.

def dep_repair(i,o):
    return [''.join(remove_ns(o, map(fst, cs)))
            for cs in powerset(unaligned_chars(i,o,INS))]

the last line is read "for cs in the power set of the unaligned chars between i and o (with respect to the INS operation)".

Comments

31 Aug : Here are the links to the papers I promised. The second one is the one you are probably interested in because it tells how to build a bounding tree. However, if you are new to Optimality Theory but understand math, the first paper provides nearly all the background you need and explains the relation of bounding sets to the rest of OT very well.

Optima — Vieri Samek-Lodovici and Alan Prince [ROA 363]

Fundamental Properties of Harmonic Bounding — Vieri Samek-Lodovici and Alan Prince [ROA 785]

In case you don't know, the Rutgers Optimality Archive (ROA) is the main place for people to upload their papers on Optimality Theory. It's like the most advanced online thing linguists have ever done. Well, at one time. The rest of the linguistic world is catching up. I found lingBuzz today and it looks interesting.

Comments

30 Aug : I finally needed a powerset implementation today and when I rewrote it this time it came out as one line:

def powerset(l):
    return ([x for i,x in enumerate(l) if 2**i & n] for n in xrange(2**len(l))) 

So that's pretty cool. The real reason I'm writing today is to record

How to Stay Sane on Mac OS

The new professor just got a new iMac and wanted to know some tips and tricks for use. These are the ones I use all the time. Your mileage may vary.

  1. Cocoa apps provide Emacs keys, the Ctrl-based half anyway. So does bash and Emacs. If you use these three pieces of software you will always have on hand a powerful set of key bindings. This requires the use of Safari or Camino, by the way. But Firefox is so ugly and buggy that you're better off skipping it anyway.
  2. There is an installable equivalent for vi keys if you want those.
  3. You can bind or rebind any menu item to a shortcut key. If you like to Zoom a lot, you can bind it for all apps. Or if you need to Revert a lot in Preview (to refresh the view from the filesystem), you can bind items from a single app.
  4. Control + scroll wheel zooms the screen. This is cooler than it sounds when it's bound to something so simple. Supposedly in 10.5 the zoom will be vector-based, but currently it gets fuzzy after a certain level.
  5. If you want virtual desktops without having to wait for the built-in ones in 10.5, there are (free?) packages out there that already do this. I don't use them so I don't know what they are. Sandra does, however.
  6. Don't use the Dock to view or start programs. Use Cmd-Tab to view running programs and their notifications, and use Spotlight (Cmd-Space) to search for apps. Press Cmd-Return to open the top hit.
  7. Alt-Cmd-D makes the dock auto-hide.
  8. Applications don't close when you close the last window. (Exceptions: cross-platform GUI libraries like Tk, Wx and Java) This is useful for single-window(-ish) apps like iTunes and Mail that have a background function but don't necessarily need a window visible at all times.
  9. Set the keyboard to require Fn for multimedia capabilities so that the F-keys are immediately accessible instead.
  10. Also, turn on keyboard access to all controls.
  11. Now Control-F2 gives access to the menu. To open a partiular menu, type a prefix of its word and press enter (one letter often suffices). Type a prefix of some menu item and press enter to select it.
  12. Because of GPL issues, GNU readline isn't installed on Mac OS. You have to install it yourself—I think fink is the easiest way to do that.
  13. Swapping Control and Caps Lock is highly recommended. Maybe even get rid of Caps Lock entirely.
  14. Relative to Windows and Linux, Cmd takes over most of the work of Ctrl and Alt, Ctrl is used for emacs keys (see above) and Alt is used for special characters. If you download Ukelele from the SIL, you can create your own custom keyboard that provide whatever special characters you want, which is very useful for linguistics. (Note: Because of a bug in Aquamacs, I had to drop in my custom keyboard in place of one of the existing ones. Aquamacs would switch back to an 'official' layout whenever focus shifted to the mini-buffer so that I could type in a command.)

Comments

28 Aug : I had to use Word all morning and it remains, even in its Mac version, a flimsy and easily breakable tool. I'm not entirely sure why it's so popular—it must be designed for tasks quite different from what I am normally trying to do. I had to edit a paper that was written to the APA stylesheet. This was not too difficult: the APA draft standard is still quite complicated, but none of the styles themselves are that complicated. But it is a royal pain to edit the figures. I had to add a new one in the middle, thereby re-ordering all the following ones since Word doesn't know that the list of figures and the figures themselves are connected. What's more, since each figure is on a separate page, the cursor jumps wildly from page to page and more often than not you're staring white space with a corner of a table. Collapsable white space was implemented in Word 2003, but it didn't migrate over to Word 2004 on the Mac.

Then I offered to help Stuart add in some lines to his paper on moras and syllable structure. This was important because he had made all of his diagrams using nothing but \|/. Yup, that's right, ASCII art in 2007. Yet he rightly pointed out that the publisher may or may not support the lines drawn with Word. So I started drawing lines in the longer places where slashes wouldn't reach. Of course this has lots of problems, the principal one being that all the lines are subtly different. I tried to avoid this by copying lines that I judged to look good, but this was limited by the fact that the ASCII art was never the same—remember the underlying base is variable width. Even when I could re-use some lines, the copy/paste interface is idiotic: when you copy, then page several pages down, and click to move the cursor there, pasting STILL pastes right beside the originally copied line. The only way to get the line several pages down is to drag it there, a laborious process that messes up from time to time.

And now for something completely different.

You may know that "ironic t-shirts" are not actually 'ironic' in the older sense of the word—they are actually surprising in an insulting way or something. I don't know a good word for that, so 'ironic' seems like a good choice. Especially since irony isn't used much round these parts. But you may not know that I commonly wear shirts that are ironic in the older sense. Yes, it's true! Here's a short list:

  1. A shirt promoting Linux. Ironically, I do not like Linux and only used it in passing for about six months. Also, there is a slogan in French on the back, which makes it double as a protest shirt (assuming that you are protesting something to do with France, to those who don't read French).
  2. A shirt promoting a resort in Cancun. Not only have I never been to Cancun, but (ironically) my mother bought it for me at a yard sale. Doubling the irony is that many IU students DO frequently go to Cancun, making me temporarility indistinguishable from them.
  3. A shirt promoting SIRSI's online interface, iBistro. Ironically, I think SIRSI is one of the worst programs ever written and that iBistro is the worst part of it. Even more ironically, the i prefix implies one of two things. Either I enjoy stupid iPod knockoffs (or am selling one), or I worked at a startup during the dot-com boom and all I got was the t-shirt. Neither are true.

27 Aug : If you are still following the puzzle posted below (that is, you are Josh Rose), here is the link to Arto Anttila's page on t-orders. I suggest the second and fourth handouts as easiest to extract the examples from. Especially the fourth has a gigantic example that should be a good stress test.

22 Aug : Well, it turns out that making that tree diagram is stupendously boring, so I'm going to post the giant GIF (warning, it's like 3000 pixels square, about 8 print pages) (also available as PDF if you're not an Adobe user) with the first two major branches of the tree filled in. I'm sure that you can extrapolate the other three if you want. So here is the setup for the puzzle. A full explanation will follow in a few days, along with my current code, which may or may not work by then. The basic setup is this: the output, the "answer", is the graph in the top-left of the picture. The input is in the root node of the tree (labelled '0'). It's a mapping from segmented words to 'violations'. The violations are actually generated from the words themselves, but that's a whole nother can of worms. For our purposes, just trust me that the lists of numbers are the important part and that the words are just a nice way to give readable output.

In fact, it is possible to solve this problem very inefficiently by trying all permutations of columns and evaluating the results according to an odd algorithm that is not quite specific to Optimality Theory, but this is precisely what I am trying to improve upon, so I have hypothesized that the answer can be extracted from this tree instead. So I should tell you how to build the tree:

How to generate the tree: At each node you may stop if there is either one column or one row left (I will sometimes call the columns 'constraints' and the rows 'candidates' since that is what they represent in Optimality Theory). Otherwise, create as many child nodes as there are columns. For each child node i, look at the ith column. The winners are those candidates (rows) with the minimum for the column i. This problem set is easier because it's all 1s and 0s, so you just choose the 0s. The losers are the ones with 1s (non-minimum value). Ordinarily these are not that important, but I am recording them in the diagram because I think they probably are in this case. You have been hinted! Anyway, eliminate column i and repeat the generation algorithm for this child i.

Continue doing this until all nodes have only one winning row or one column. At each step, to generate the "bounding profile" for that node, take the minimum+1 of each column, unless all values are the same, in which case it's just the minimum. The bounding profile is the second box in each node in the diagram, but I don't think it is important for this problem, although it might be. To make things a little easier to track on the diagram, I labelled each node with the rows that have been removed (counting from 1, not 0, sorry). HOWEVER, I have only completed child 1 and 2 of the entire tree (though I included a couple of nodes from 4's branch). That should be enough to get an algorithm that works, especially for the left sub-graph of the answer.

Now, if you look at the top-right corner, you'll see my current answer. The green and red links are the ones in the answer; my algorithm currently misses the red one. This is because "cost][me" and "cost" have the exact same violation pattern—I think this is as good as you can get without considering the inputs /cost me/ and /cost/ separately. Which I may have to do pretty soon, in which case this giant diagram might get split up. A couple of notes:

  1. Note that the link from "cost][us" to "cost" is actually in the real answer, but all you see is the transitive closure of the real answer: if you can get from one node to another indirectly, then the direct link is deleted.
  2. All nodes with the same *set* of columns removed are identical and shouldn't provide any additional information. This would make a cool optimisation (Reallly Simple Memoisation), but I haven't got around to it yet. For example, node 1-2-3-5 is the same as node 1-3-2-5 and 1-3-5-2. Err, at least, I think. Now that I check I'm not completely sure. Maybe it's only true at the leaves? Well, that's why it's an optimisation!
  3. I will post the background reading for this as soon as I find the PDF on my computer. I have a printout I made, but I can't find the original file! I may not have downloaded it to this computer. I hope to have it by tomorrow or Saturday.

Side note: Apparently it is bad juju to let your computer fall to sleep with a USB flash drive attached. Who would have thought? The Finder AND bash told me everything was still connected but hung while trying to actually open anything. I had to disconnect the drive and suffer the placebo 'data damage' warning to get things going again. Bah Humbug!

21 Aug : I have an interesting problem, a puzzle, for you that I am going to post in a few days. I hoped to do it today, but I really want to finish drawing the tree diagram that makes the solution much easier. Actually, I *think* I've got the solution, but I haven't written and run it yet to make sure. So it may turn out to be the moral equivalent of posting my homework online and waiting for someone to do it. Well, except nobody has ever solved this homework and I intend to publish the answer when I get it. So, uh, if you happen to come up with a better solution than me I'll use it and put your name in the acknowledgements or something.

Comments

18 Aug : There are about three ways I use a REPL, some of which are less obvious than others. (By the way, REPL means "Interactive Mode" and stands for Read-Eval-Print Loop.)

How to use a REPL

  1. As a way to explore the language, to see what is legal and how the basics work, as well as to explore APIs. The best kind of REPLs for this have excellent Intellisense in addition to good error reporting. REPL demos highlight this use of the REPL as they build a GUI at the command line and alter it at runtime.
  2. As a unit-testing tool. Evaluate a single function and make sure it works using various inputs. Or evaluate your tests and investigate the failures without re-running the tests. This use stresses the REPL's ability to load snippets of code.
  3. As a way to run the application. This is not appropriate for most people because it limits your users to programmers, and only those who are willing to load your source code. This use stresses the REPL's ability to properly load whole systems.

The first use is the easiest and most obvious, as well as the first you're likely to encounter simply because you usually learn a language at the same time you discover that it has a REPL. One of the nicest features for exploratory use, besides obvious error messages, is Intellisense command completion. Emacs doesn't have very good support for this, so it's been a while since I've used one that did. I think wxPython's PyCrust is good, and F# Interactive might also have Intellisense. Haskell doesn't support definition of types at the command line, which cripples its exploratory use.

The second use is a bit stickier in terms of usability. If you define a function from a file, you need to the ability to redefine it easily when you change your mind. Ideally, the REPL will support both definition from snippets (in effect, an automatic cut-and-paste) as well as loading from files. Many REPLs also support two ways of loading from a file as well: one way is to paste the whole file without regard for namespacing niceties, and the other is to import the file in the way a compiler would.

The third I have only found myself using for research where I am the only user, and only in Emacs-supported Scheme, Common Lisp or Python REPLs, all of which support whole-program loading with some degree of facility. (Scheme the least of the three). This is hard to get right: Python's system is simple but has problems completely reloading sometimes, while Common Lisp's is confusing but of course works nicely once you've understood everything. This use is so hard to implement and useful for so few people that I can see why it's so rare. But I really like it because instead of an HTML output file, I get a live data structure that I can play with. I do have problems remembering to record all the changes I've made sometimes, though.

Note: My experience has been shaped primarily by Scheme, Common Lisp, Python, OCaml and to a lesser extent Ruby, Haskell and F#. If you've used REPLs for BeanShell, Javascript, Scala, Groovy (shudder), your experiences may differ. Please comment after I get the commenting problems sorted out.

Comments

13 Aug : Once again Caml shows that it wasn't designed but just sort of stuck together. It has two working ways to annotate separate compilation units with types. The first is to annotate the types inline, which is horrendously ugly due to several mistakes in the 'design' of the syntax:

let add_first (number : int) (i::is : int list) : int = i + number

All those parentheses are probably why SML coders prefer to curry their functions so that they can use commas to separate arguments like most languages do. Even so, I don't like all those types making my lines super-long. I'd prefer the type annotation to be on a separate line, which also makes it easier to kill if you change a bunch of stuff and want the compiler to infer the type for you.

The other possibility is to move the type annotations to a separate 'signature' file. Does this remind you of something? Could it be C++? Why yes I believe it is. I hate having to open a separate file to maintain and view types. But there is a third solution...it just happens not to work. You can declare a module inside your code file and stick the signature, which contains the type annotations for all your code, at the top of the module.

The problem is that this is a NESTED module, so you end up saying stupid stuff like Constraints.Constraints.id_voice. Does this remind you of something? Could it be Java? Why yes I believe it is. Fortunately, you can say use Constraints as a stupid workaround to shave off the first Constraints. The result ends up looking very much like Python, except that it is probably quite annoying to those proficient in Caml. I know I'm not completely happy.

If you're curious, here's the code for a module containing the code in the first example:

module Adder :
sig
  val add_first int -> int list -> int
end = struct
  let add_first number (i::is) = i + number
  let private_function i = i + 1
end
(* In another file... *)
open Adder
Adder.add_first 12 [1;2;3];;

Also, I know that comments are not working all the time right now. My hosting company is having trouble providing me with the Python library to interface with MySQL. I might have it fixed soon.

Comments

7 Aug : I am about two-thirds moved in at the Wampler's upstairs apartment. The third that is missing mostly cleaning supplies and cooking implements, and I only really miss the latter (as yet). Unfortunately this week seems to be the hottest of the year—I'm dreading the cleaning at the end of the week.

I braved the lion's den today, the Parking Permit Office, and came away with Second Prize, a C Permit. I have yet to determine where this will let me park; I hope to upgrade it later to an A Permit, but it seemed like to much to attempt in one go. Especially since the bureaucrat-at-arms assured me that only teaching assistants get A Permits and so long as I remain a research assistant I Will. Not. Get one. I didn't want to argue, but my contracts for the last two years have had a clause stating that I am eligible for an A. Maybe that's changed. I got off a lot better than the guy in front of me, a new grad student in the English department. He wasn't on The List, and didn't have an Employee ID, both of which are Required. So no Permit for him! Anyway now I have to find out where C permit holders are allowed to park because I have no idea.

Meanwhile, I am running a six letter Gen right now on Jones, now that the air conditioner is fixed and I'm not worrying about frying it*. I expect it to run in about 22 hours and cons about 768 MB. Jones usually runs about 1.5 GB free, so as I worriedly sat there watching the memory usage tick upwards, I realised that I don't actually need to keep all those potential winners just to run Recursive Constraint Demotion. All you need for that is the violation profiles of the possible winners and then the form that was actually heard. Unfortunately, to get the correct violation profiles, you still need to look at every possible candidate, and I have no way to know when to improve that as yet. But I have some ideas, at least for straight Max/Dep constraints—I may end up hard-coding something to look for them especially.

Of course, I suspect the real problem is that most linguists drastically underspecify the constraint set and just assume that the example forms that they give are representative. And they usually are, even if there are jillions of nearly nonsense words that have the same violation profile for the three to five constraints they mention. This practise only breaks down if one of those nonsense words happens to be better than the actual winner, because it means that the proposed constraint ranking is wrong. Kartunnen pointed out that this actually happens, and I suspect it's a lot more often than the single example he gave.

Well, the new iMacs are out. (Finally!) As soon as I get the reimbursement from my flight to the ACL in Prague, I will be buying one. I know, they're a bad value for the money, Macs are only really compelling as laptops, blah blah. But they jive well with my philosophy so I am going to buy one anyway. Anyway I can still run Windows—VMWare released their consumer Mac virtualisation program a couple of days ago in preparation for this announcement, and Parallels has been around for a while (although reviews give better marks to VMWare). I also plan to try Linux and at least play around with it.

Also, Apple finally updated their keyboards and yes, its USB ports are FINALLY 2.0 like every other decent keyboard on the planet. It even looks better than the Dell green-light special** I have...I actually like the laptop form factor better for keyboards because it means the mouse is closer to your hand. (I would just move the mouse to the left side, but my left hand has enough trouble with wrist pain already.) The main advantage a desktop keyboard has is that the arrow keys are bigger and there are 19 (!) F keys. Plus it looks like the wireless keyboard doesn't have USB ports. So I'll probably get the wired one and buy the wireless one later if I miss the keyboard layout of my trusty Powerbook.

*The air conditioner in the comp ling room went out on Thursday/Friday and it's been hot enough to worry about the safety of the servers without air conditioning. We turned off the non-web serving one (smith) and left jones on hoping that it wouldn't meltdown.
**I have no idea why it's 40 bucks on that page. I only paid 20, which is about right. Disregard the stupid plastic "wrist rest".

Comments

2 Aug : Free For All. Chapter 5 is my favourite so far. I had to stop reading at that point.

Also, I'm making a fair attempt at learning to use viper-mode in emacs. For after-the-fact editing tasks, the vi keyset is nicer because it has finer control over greater distances. For creation of new text, it's not so great because the inconvenience of switching modes loses for quick local edits. Of course, anecdotal accounts suggest that vi keys are easier on your hands in the long run, too. Perhaps it's because the cursor keys are on the right hand (h,j,k,l) which takes pressure off my over-used left hand. At any rate, viper-mode allows me to use BOTH emacs and vi keys in the appropriate mode, which is extremely powerful and difficult to keep straight. I'm not sure I'll keep it up. If nothing else it's good practise to keep my mind flexible I guess...

Comments

30 July : Here's another quandary I ran into about strong type systems. I very commonly use a tree representation that is (node, [children]) which stems originally from learning to represent trees as (node . children) in Lisp. A leaf just has an empty list since it has no children: (leaf, []). All well and good, this is not irregular as are some tree representations in dynamically typed languages. So I was thinking to myself, "if Haskell syntax is basically the same* as Python's except nicer in places, why am I not using Haskell?" So I wrote a toy function:

tw [row] = (row, [])
tw (n:ns) = (n, map (\ i -> tw ns) [1..3]
<interactive>:1:31:
    Occurs check: cannot construct the infinite type: b = (a, [b])
      Expected type: b
      Inferred type: (a, [b])
    In the application `tw ns'
    In a lambda abstraction: \ i -> tw ns

At least the error message is better than Caml's (which I tried first). Of course this tree representation *works*, you just have to wrap it in type, give it a name:

data Tree = T (Int, [Tree])
tw [row] = T (row, [])
tw (n:ns) = T (n, map (\ i -> tw ns) [1..3])

Of course Haskell then becomes MUCH more inconvenient even than Caml because you can't declare new data types in interactive mode, and new data types are not by default printable, so you can't see the results of tw without implementing the Show interface. These are not central to the problem of course, but more than enough to remind me why I don't use statically typed languages for prototyping yet. Probably someday soon they'll catch up; Caml is already almost as good (although extremely ugly) and Haskell is pretty if a little rough around the edges.

Oh, yeah, that tree question came up because I got a super-fast implementation of Gen (from OT) working based on tree search instead a search of all subsets. Well, super-fast compared to O(n!) of all subsets. I think it's O(n lg n) but I might be wrong—that's a pretty big improvement. Well, and it's not done yet, but the part that's done works.

Comments

*Only similar for functional style. Imperative Haskell looks similar to C rather than Java like imperative Python does. (Neither are very like those other languages.)

29 July : Bonus content free day! This is why I don't do classes.

class Col():
    def __init__(self, constraints=None, vlns=None):
        self.cons = constraints or []
        self.vlns = vlns or [[]]
    def win_fav(ns):
        for n in ns:
            if n > 0:
                return True
        else:
            return False
    def tied(ns):
        for n in ns:
            if n!=0:
                return False
        else:
            return True
    def lose_disfav(ns):
        for n in ns:
            if n < 0:
                return False
        else:
            return True
    def markcol(self, i):
        return marked(self.cols[i])
    def fav_active(self, i):
        return win_fav(self.vlns[i])
### versus ! ###
# maybe someday I should create a Col class....nahhhh
positive = cur(lt, 0)
negative = cur(gt, 0)
win_fav = cur(some, positive)
tied = cur(every, lambda x:x==0)
lose_disfav = negate(cur(some,negative))
markcol = compose(marked, fst)
fav_active = compose(win_fav, snd)

Admittedly not all OO code is that bad at re-using abstractions, but most is. I especially hate the "encapsulated list" pattern visible in markcol and fav_active.

Comments

28 July : Guess what this code does without looking below to find out:

from util import *
MAX_NGRAM = 7
keys = {'1':'_,@', '2':'abc', '3':'def', '4':'ghi', '5':'jkl', '6':'mno',
        '7':'pqrs', '8':'tuv', '9':'wxyz', "*":' ', '0':"+"}
corpus = dct.count(word[:n]
                   for line in open('/Users/zackman/Documents/ebooks/Democracy in America--Vol 1.txt')
                   for word in line.split()
                   for n in range(1,MAX_NGRAM)
                   if len(line) > n)
def predict(ks):
    return sorted(map(''.join, lst.cross(*map(keys.get, ks))),
                  key=corpus.get, reverse=True)
print predict('263')[:5]

If you guessed "predictive texter", you're right. Like most good ideas, a fake prototype can be written in a dozen lines. There are many weak points that would be easy to improve. The corpus I used is over 150 years old and written by a French author. The predicter stops predicting for words over 6 letters long. All words are stored together. There is no facility for telling the predicter that part of a word is right and going from there. The predictor doesn't try to complete the whole word when you've only typed part of it. And naive counting has lots of problems that I don't even want to get into but you can read for yourself (warning: extremely boring book).

The idea I had for predicting words longer than 6 is to only pull up words in a dictionary, because frequency stops being such a good indicator for words that long. There is probably only one correct word anyway. But I don't have enough memory free on my laptop to play with that while I continue working on my REAL work.

For example, the example above (predict('263')[:5]) returns ['and', 'bod', 'ame', 'cod', 'coe'], which is good in that 'and' is the best word, but bad in that the subject matter shows through in result 3, 'ame', the prefix of 'America'. The corpus is Volume 1 of Alexis de Tocqueville's Democracy in America.

I got the idea for this from (1) getting a cell phone (2) talking about an internship/job with Tegic, who is the major provider of predictive texting. Obviously a real system is more complicated and cleverly optimised (and likely written in C++) but a 12-line prototype just makes you feel good all over.

In other news, Boy Scouts have taken over campus! Film at 11!

Comments

27 July : A couple of topics today. First I want to ridicule the 'Black Google' idea, where this guy made an inverted-colour Google home page to save energy. Except that only CRTs save energy from displaying black; most LCDs use *more* energy to display black because the backlight is on. What's more, the CRTs that are still in use are probably not generating anywhere near their fair share of Google queries because they are likely hooked up to old computers with limited (if any) Internet access. Light text on a dark background is very likely a usability problem to boot; I have an unfortunate addiction to this colour scheme when coding and it actually *reduces* battery life because you have to crank up brightness more to get the equivalent level of visibility. So to actually save energy on Google, just turn your LCD brightness down. Or switch from using a dryer to a clothes line.

Also, in my 'successful implementation of Gen' below, I mentioned that lst.cross would make you run out of memory for large values. It's pretty obvious; if your alphabet size is 45 (the average number of phonemes in English*), a word of length four has 445 possibilities. You *don't* want to store all the possibilities for a ten letter word. Unfortunately, my original recursive lst.cross does just that, generating the list incrementally:

def cross(*ls):
    """[[]] -> [[]] Return all permutations of lists in n.

    Examples:
    >>> cross([1],[2,3],[3])
    [[1, 2, 3], [1, 3, 3]]
    >>> cross([1,2,3],[2],[3])
    [[1, 2, 3], [2, 2, 3], [3, 2, 3]]
    """
    if not ls:
        return [[]]
    else:
        rests = cross(*cdr(ls))
        acc = []
        for x in car(ls):
            acc.extend([[x]+rest for rest in rests])
        return acc
alfa = 'abc'
print cross(*([alfa] * 4)) # for a four letter word

Like all recursive solutions to recursive problems the code is quite short, but in this case it is not trivial to transform the recursion to a while loop because there is a quite a bit going on after the recursive call in the non-terminal branch. I played around with a continuation-passing transformation and got nowhere. Then I Googled for solutions and found a smart numeric solution to the power set problem.

A power set is not quite what I'm looking for: it's all subsets without replacement. I'm looking for all subsets with replacement. But the idea is similar. A power set of {a,b,c} is {{}, {a}, {b}, {c}, {ab}, {ac}, {bc}, {abc}} while my problem specifies number of times to choose from the alphabet: {a,b,c} * 2 is {{}, {a}, {b}, {c}, {aa}, {ab}, {ac}, {ba}, {bb}, {bc}, {ca}, {cb}, {cc}}. The numeric trick for power set is to notice that the elements chosen from the source set follow the same pattern as bits in a binary number. So you can loop up to 2n and generate all elements of the power set.

At first I thought this trick was useless for choosing all elements with replacement. But look! Reduce the alphabet to size two: {a,b} gives {{}, {a}, {b}, {aa}, {ab}, {ba}, {bb}} while the power set is {{}, {a}, {b}, {ba}, {bb}}. It's even more obvious if you have the alphabet {0,1,2,3,4,5,6,7,8,9}; for size 2 you get {{}, {0} ... {9}, {00}, {01} ... {42} ... {99}}. That's just counting, plus some zero fills. So here's what I ended up with.

def gen_strings(start=0, stop=100):
    "100 is a hoky limit but you will still wait a LONG time to get them all"
    alfa = '0123456789'
    for n in xrange(len(alfa)**stop):
        acc = []
        i = 1
        while i <= n:
            acc.append(alfa[(n // i) % len(alfa)])
            i *= len(alfa)
        yield ''.join(acc)
        while i < len(alfa)**stop:
            acc.append(alfa[0])
            i *= len(alfa)
            yield ''.join(acc)

The first while loop just yields the normal counting numbers over whatever alphabet you're using. The second one provides all zero-prefixed variants. In other words, you don't normally count "01,02,03..." but "1,2,3...". The second while loop fills in those zeros. The down side is that you don't get the numbers in counting order any longer. But of the giant upside is that now you have an iterator that provides only as many answers as you want in constant space.

Comments

*The number of phonemes in English varies according to your dialect. For both 'cot' and 'caught', I have a single sound intermediate between Standard English's phonemes /ɑ/ and /ɔ/. So my phonemic inventory is one smaller (at least) than Standard English. Compare this to some Atlantic dialects that have MORE vowels down there.

25 July : I just finished writing (and running) a successful implementation of Gen from OT, and it was so exciting that I have to sit down and write about it right away. For those of you who don't understand OT's Gen function, but do know what Python generators are, you can probably guessed how I implemented it. Optimality Theory (OT) says that when we pronounce a word, we take our knowledge of its representation (sort of like the spelling) and generate an infinite number of possible pronunciations, choosing the one that best satisfies the constraints of your language.

That's all very nice for a theory, but the idea that we generate an infinite number of words for every word we say is ridiculous. It's even ridiculous for a computer implementation, which is what I'm making. Nevertheless, here is a naive, infinite implementation:

def gen_strings(start=0, stop=100):
    "100 is a hoky but realistic limit (because of use of lst.cross)"
    alfa = 'abdefghijklmnoprstuvwz'
    for i in itertools.count(start):
        if i > stop: return
        for s in itertools.imap(''.join, lst.cross(*([alfa]*i))):
            yield s

Well, not completely naive. I do allow the user to pass a stop value because ALL possible combinations of 100 characters with replacement is already. So it's already not infinite, just really big. Since I use lst.cross to generate the combinations, you'll run out of memory trying to generate them all anyway. Anyway, you can say for word in gen_strings(): print word and wait for a very long time before the last word prints. However, what we really want to do is look at words that could possibly appear—a lot of words are so unlike the input that they will always be beaten by some other word or set of words. Here's the resulting code:

def gen(i, o, cons):
    winners = []
    for cand in gen_strings(stop=len(o)):
        if not bounding(table(i, cand, winners, cons)):
            print cand, ":", [rcd.ot_eval(c,cand,i) for c in cons], '(', len(winners), ')'
            winners.append(cand)
    return winners

bounding returns the bounding set, the set of competing candidates that will *always* beat the current candidate word. If there are none, then the word is a possible winner and is added to the set so that it can bound. Of course this is a little lossy because there are some candidates early on that get added to winners which might be bounded by candidates added later, but this isn't a big deal compared to all the candidates we are *actually* getting rid of. On a toy problem, this eliminated all but 59 of the 506 candidates of length 2 or less and all but 59 of the 11,154 candidates of length 3 or less (which is the expected result; you can't pronounce a short word as super-long and get away with it).

Finally, this is only interesting if you follow OT, but the code is based on ROA number 363, and the definitions of bounding and table are given below.

def table(input, w, losers, constraints):
    return [[rcd.ot_eval(c,l,input) - rcd.ot_eval(c,w,input) for l in losers]
            for c in constraints]
def bounding(tab):                                       # n^2 * k-1 = O(n^3)
    while True:                                            # O(k - 1) (average case is better)
        promote, demote = lst.bifilter(rcd.lose_disfav, tab) # n^2
        if not demote or not promote: return demote          # 1
        tab = filter_rows(promote, demote)                   # n^2 (ideally)
def filter_rows(promote, demote):
    return lst.transpose(rcd.filterby(rcd.tied,
                                      lst.transpose(promote),
                                      lst.transpose(demote)))

The calls to lst.transpose are slow and only necessary to make the code prettier, so I probably ought to get rid of them.

Comments

23 July : Don't tell my dad, but I use a thesaurus a lot these days. At least it's an electronic one with good response time. I've been writing a lot lately and I have trouble thinking of sensible words. I keep trying to write "concomitant" instead of "associated" in one of my papers and I'm pretty sure even in an academic paper that's ridiculous. Then I wanted to avoid "pooh-poohed" in an e-mail because it sounds stupid and could only come up with "denigrated". Thesaurus to the rescue! Thank you, OS X Tiger! (Buy today, only $129)

Comments

22 July : Maybe you noticed the data structure at the top of my toy compiler yesterday:

(def place car)
(def code cdr)
(def make-exp cons)
(def nil '())

This is a very common way to prototype code in Lisp, but it really ISN'T a data type declaration. In fact, the equivalent Javascript is just

function place(exp) { return exp[0] }
function code(exp) { return exp[1] }
function makeExp(place, code) { return [place, code] }
var nil = null // just because I like writing nil, that's all

Now it's obvious that I'm just wrapping a two-element list in a few functions and pretending it's a data structure. Why didn't I at least use the perfectly good struct capabilities built in to MzScheme?

It turns out that in Lisp, prototyping with lists as the only data structure has some advantages that make it one of the best prototyping languages. Unfortunately, those advantages are weaknesses when the time comes to create production code. Base Scheme doesn't have many ways to switch over, although other Lisps vary.

  1. No encapsulation. This means that any function can poke around in the class even if it's not the Right Thing at that point (but is necessary to handle something you hadn't thought of). It means that the default printing mechanism does a good job of displaying your data structure without writing specialised printing functions. It means that match and let macros already know how to destructure your data structure without being taught. In fact, this is the second advantage...
  2. Practically every built-in function in the language works on lists, meaning you can put them to use in clever and hacky ways if need be. Although you normally pretend that the two-element list is an opaque data structure, if you need to do cdr . (group 2) to get pairs of all but the first two properties for some reason, that's OK. You can do that. There's no casting to list, no manual extraction—this means that's it's easy to insert quick shims to get something to work while you're blazing ahead on something new. You can fix them out later when you're not seized by inspiration. This is very different from otherwise similar languages like Caml*, where you must satisfy the compiler at every step or No Program For You.
  3. No mandated structure. If you decide that some lists need to have a third item to hold a new property, that's fine as long as your code handles it. No declarations need to change and if you're lucky very little else does.

These properties allow you to be flexible, changing your mind many times without having to alter things to make the compiler happy. However, they have a downside as well. Without data type declarations, there is no machine-validated type documentation. With the ability to throw in and handle special cases at any point, lengthy code becomes fragile. And when you use an existing list function in an odd way, you violate expectations of its normal use. And of course things like encapsulation and mandated structure are good in their place.

So what you really need is the discipline to throw away your prototype once it's done its job. Fortunately, this is easy because linked lists have some undesirable performance properties; once you start running your prototype on real data, you will probably find that it needs a performance boost anyway. Even if you stick with Scheme, you'll find that industrial strength Scheme looks quite a bit different from the rapid-prototyping form that is its normal mode.

Comments

*Caml and Scheme are very similar semantically. The primary difference is the importance of data types, and the fact that everything is checked at compile time.

21 July : I hadn't realised it had been so long since I had written something here until a friend pointed it out. Well, I've been here and gone again and recently I've been doing a lot of writing on my qualifying paper about measuring British dialects with a computer. It's pretty interesting and makes me wish I knew more about British dialects. Not KNOWING what your computer is crunching is one of the downsides of computational linguistics (and sometimes, any linguistics).

Today I've been writing a toy compiler. I discovered my first edition Dragon Book while packing books in preparation for moving. I bought it off of ebay my last year of undergraduate to learn about parsing (which was not taught in any of the classes of course)**. It's a great book, although outdated by now since it was first printed in 1977*. Just reading through it makes you want to try the ideas, but at my first reading I was so focussed on parsing that I forgot that the book has more than 250 pages. In fact the last half of the book is all about code generation and optimisation and is now interesting to me that my interest in parsing has cooled (although I may take a seminar on it next semester).

Right now the code has basically three parts: The code generator (compiler), a simple virtual machine and some tests. The only thing the compiler does right now is put everything in tail form—new variables are created when they are assigned to. This is unrealistic for a real VM, so at this point I have created a compiler that emits BASICA rather than ASM. And I must say that the other part, the runtime, it very simple to write. BASIC evaluators are a snap because they are so simple.

Tail form is just the term I picked up from the Scheme community, by the way. All it really means is that there are no expressions and the only emitted statements are at most x = y op z. For example, ka = (x + y) - z is flattened out to:

t1 = x + y
t2 = t1 - z
ka = t2

Notice that the last line is unnecessary. I don't know how to fix that right now, but I expect I will find out as I read more of the Dragon Book. The concept of tail form is extremely useful because if you recognise it you will be able to notice isomorphisms like this one:

var a = [1, 2, 3] // Javascript
var a = new List<Integer>(); // C# (3)
a.Add(1);
a.Add(2);
a.Add(3);

And if you are compiling Javascript to C# you will be able to compile the first form to the second.

Unfortunately at this point I have a bug. I have to pass the current instruction number around so that instructions that generate gotos will know where to jump and somehow the number is coming out too big. I know for sure that some numbers are off because of using a simple but incorrect macro I wrote, but that error should make the numbers smaller not larger. I think.

Here is the code. Feel free to point and laugh but remember this is toy code that is still in flux. If you fix some bugs please send me a patch.

Comments

*You might be interested in buying the newest edition. It looks just as good as the old one, complete with cover art that is 15 years oodate at time of printing.
**Ironically, the computer science page at C of O has a picture of me giving a poster on parsing.

3 July : The problem with iterators in Python is that they fundamentally unfriendly to functional programming: they operate by destructive changes. Here's an example:

def main():
    ls = map(read_unicode, FILES)
    return [max_dist([dist(l1, l2) for l2 in after(l1, ls)]) for l1 in ls[:-1]]
def max_dist(distances):
    "[(num,num)] -> (num,[num])"
    return avg(map(fst, distances)) * 2 * 107, map(snd, distances)
def dist(l1,l2):
    d = lev.totalavgdistance(map(unify, l1), map(unify, l2))
    return d, sum(lev.flevenshtein(unify(s1), unify(s2), d)[-1][-1]
                  for s1,s2 in zip(l1,l2))

Look at the second line of main. See how those are both list comprehensions and not generator comprehensions? The outer one is that way merely for convenience, since I'm running from Interactive Mode and Python doesn't force iterators for display, just in case they're infinite. The second comprehension would be nicer as a generator comprehension though: it's just a bunch of temporaries returned from dist waiting to be averaged. The problem is that the temporaries are tuples, so dist has to unpack them, with map. At first I thought this was ideal, since map converts any iterator to a list, which is required for avg. But the problem is that I have to call map *twice* in order to get the second half of the tuples. This doesn't work because iterators can only be iterated once.

So you see the real problem; a functional programmer wants to think of iterators in terms of streams (lazy lists). These involve destructive changes, but they are invisible to the user: if you have a lazy list, asking for the first item will always return the same thing, you just don't know if the computer had to evaluate it or already had it stored. Functional programmers can only use generator expressions in cases where the subsequent code only needs the iterator interface (first, next, null?) and not the list interface (length, etc.) AND in the case that they know the iterator will only be evaluated once. The first condition obtains a lot, but the second is much harder to track—you have to remember that you are only allowed to use the iterator one time, in one place.

Slightly later: Here's a fix to max_dist that makes it mention the passed iterator only once. I'm still not happy, though. I'm having to do odd things in order to make the computer happy. It should be the other way round. (You might think this is more efficient since it only makes one pass, but I doubt it, since transpose calls zip, which conses a lot too.)

def max_dist(distances):
    "iter<(num,num)> -> (num,[num])"
    ds, sums = transpose(distances)
    return avg(ds) * 2 * 107, sums

Comments

2 July : Talking with Kaleb after getting back to the US made me wonder how far back my saved files go, exactly. I poked around my old hard drive and found a back up from around 2000. This contained an even older backup that had a bunch of Visual Basic and QBasic programs I had written in the Windows 3.1 era. Last year I used the .NET upgrade wizard and noted that it does a good job given the difficulty of its task. So I thought I would throw at it not a program encrusted with the cruft of age, but still compiling on current systems, but ancient programs from the glory days of VB. (That was VB 3, in case you're curious.)

To start, I imported a VB 5 project, which came in with no problems. The only quibbles were the mapping of String * 1 (the idiomatic way to simulate chars in classic VB) onto a String with weird annotations. Or sometimes a wrapper class of String, because the translator had to satisfy the type system AND the object system, neither of which are particularly flexible. The human fix (translate the type to Char) is trivial, but it's things like this that are a deal-breaker for Real World projects. The other problem is that I never finished the program when I wrote it in 2000. I wrote a bunch of code that was supposed to build Huffman trees after learning about them in Discrete Math class, but never got it working. Oops.

Then I tried a VB 3 project. This turned out to be a problem, because devstudio looks for .vbp (VB project file) and there was none. However, I came back after a bit and realised that the contents of the .mak file from the VB 3 project was nearly identical to the .vbp file the from VB 5 project. So I changed the extension. In another 20 seconds I was using a program from my second week of Visual Basic class in 1995. The only thing missing was the updating between scroll bars and labels—the scroll bar event model changed in mysterious ways in .NET, although the wonky standalone focus indication* hasn't changed since Windows 3.1. This should tell you something, I'm not sure what.

In summary, VB.NET can still import code from twelve years ago if it is simple and you're willing to put some work into some trivial fixes. This is great for 40-line programs from high-school VB class, but it just doesn't cut it for the masses of spaghetti-code VB6 programs that would take a month just to cajole into compiling.

*Scroll bars that are not part of the window dressing are rendered using an entirely different process at great expense. So DON'T USE THEM!

Comments

*I'm not necessarily against the word 'blog', I just don't use it. Besides, look at the filename. The filename and real name should be similar.