| View 2007 | Ascending chronological order | Random order |
| Sorted order | Reverse sorted order | Rot-13 Encrypted |
| Reverse order | Vowels removed | All of the above |
| RSS Feed | View all comments | View all |
4 Sep : I have installed a Real Blog (b2evolution) over at http://www.sandersn.com/blog. I do not like it much so far, but it has lots of features. Among them: comments. I will see if I can turn the rest of the features off, and perhaps remake the comment system so that it agrees with my philosophy. As I can't stand edit boxes, I will continue to use Emacs, meaning that this file will continue to grow, although perhaps in a more haphazard way.
1 Sep : A quick note about naming conventions. In imperative languages, functions are always verbs (with some modifiers). This is because they DO something. A lot of them return something too, but most of them DO something and return nothing. This is different from functional languages where functions RETURN something and rarely DO anything. The line between code and data is in a different place than in an imperative language. So the naming conventions change accordingly. Functions are often nouns.
This is nowhere more obvious than 'get'. 'get' is ze prefixe de rigeur in Java convention because all data is encapsulated in some object and only methods can give you access to that data. In a functional language, however, data is named by a noun, whether or not you have to call a function to get it. And some languages, like Haskell, make it rather unclear whether a zero-argument function really is just data in the first place.
For example, I just wrote a function called find_unaligned_chars. But after a little reflection (and a little line-wrap) I changed it to unaligned_chars. Of course you're finding them—what would be interesting is if you deleted them or appended them to something. The 'find' verb is just redundant in this case because it's the default case for functional programming. By the way, here is where unaligned_chars is used.
def dep_repair(i,o):
return [''.join(remove_ns(o, map(fst, cs)))
for cs in powerset(unaligned_chars(i,o,INS))]
the last line is read "for cs in the power set of the unaligned chars between i and o (with respect to the INS operation)".
31 Aug : Here are the links to the papers I promised. The second one is the one you are probably interested in because it tells how to build a bounding tree. However, if you are new to Optimality Theory but understand math, the first paper provides nearly all the background you need and explains the relation of bounding sets to the rest of OT very well.
Optima — Vieri Samek-Lodovici and Alan Prince [ROA 363]
Fundamental Properties of Harmonic Bounding — Vieri Samek-Lodovici and Alan Prince [ROA 785]
In case you don't know, the Rutgers Optimality Archive (ROA) is the main place for people to upload their papers on Optimality Theory. It's like the most advanced online thing linguists have ever done. Well, at one time. The rest of the linguistic world is catching up. I found lingBuzz today and it looks interesting.
30 Aug : I finally needed a powerset implementation today and when I rewrote it this time it came out as one line:
def powerset(l):
return ([x for i,x in enumerate(l) if 2**i & n] for n in xrange(2**len(l)))
So that's pretty cool. The real reason I'm writing today is to record
The new professor just got a new iMac and wanted to know some tips and tricks for use. These are the ones I use all the time. Your mileage may vary.
28 Aug : I had to use Word all morning and it remains, even in its Mac version, a flimsy and easily breakable tool. I'm not entirely sure why it's so popular—it must be designed for tasks quite different from what I am normally trying to do. I had to edit a paper that was written to the APA stylesheet. This was not too difficult: the APA draft standard is still quite complicated, but none of the styles themselves are that complicated. But it is a royal pain to edit the figures. I had to add a new one in the middle, thereby re-ordering all the following ones since Word doesn't know that the list of figures and the figures themselves are connected. What's more, since each figure is on a separate page, the cursor jumps wildly from page to page and more often than not you're staring white space with a corner of a table. Collapsable white space was implemented in Word 2003, but it didn't migrate over to Word 2004 on the Mac.
Then I offered to help Stuart add in some lines to his paper on moras and syllable structure. This was important because he had made all of his diagrams using nothing but \|/. Yup, that's right, ASCII art in 2007. Yet he rightly pointed out that the publisher may or may not support the lines drawn with Word. So I started drawing lines in the longer places where slashes wouldn't reach. Of course this has lots of problems, the principal one being that all the lines are subtly different. I tried to avoid this by copying lines that I judged to look good, but this was limited by the fact that the ASCII art was never the same—remember the underlying base is variable width. Even when I could re-use some lines, the copy/paste interface is idiotic: when you copy, then page several pages down, and click to move the cursor there, pasting STILL pastes right beside the originally copied line. The only way to get the line several pages down is to drag it there, a laborious process that messes up from time to time.
And now for something completely different.
You may know that "ironic t-shirts" are not actually 'ironic' in the older sense of the word—they are actually surprising in an insulting way or something. I don't know a good word for that, so 'ironic' seems like a good choice. Especially since irony isn't used much round these parts. But you may not know that I commonly wear shirts that are ironic in the older sense. Yes, it's true! Here's a short list:
27 Aug : If you are still following the puzzle posted below (that is, you are Josh Rose), here is the link to Arto Anttila's page on t-orders. I suggest the second and fourth handouts as easiest to extract the examples from. Especially the fourth has a gigantic example that should be a good stress test.
22 Aug : Well, it turns out that making that tree diagram is stupendously boring, so I'm going to post the giant GIF (warning, it's like 3000 pixels square, about 8 print pages) (also available as PDF if you're not an Adobe user) with the first two major branches of the tree filled in. I'm sure that you can extrapolate the other three if you want. So here is the setup for the puzzle. A full explanation will follow in a few days, along with my current code, which may or may not work by then. The basic setup is this: the output, the "answer", is the graph in the top-left of the picture. The input is in the root node of the tree (labelled '0'). It's a mapping from segmented words to 'violations'. The violations are actually generated from the words themselves, but that's a whole nother can of worms. For our purposes, just trust me that the lists of numbers are the important part and that the words are just a nice way to give readable output.
In fact, it is possible to solve this problem very inefficiently by trying all permutations of columns and evaluating the results according to an odd algorithm that is not quite specific to Optimality Theory, but this is precisely what I am trying to improve upon, so I have hypothesized that the answer can be extracted from this tree instead. So I should tell you how to build the tree:
How to generate the tree: At each node you may stop if there is either one column or one row left (I will sometimes call the columns 'constraints' and the rows 'candidates' since that is what they represent in Optimality Theory). Otherwise, create as many child nodes as there are columns. For each child node i, look at the ith column. The winners are those candidates (rows) with the minimum for the column i. This problem set is easier because it's all 1s and 0s, so you just choose the 0s. The losers are the ones with 1s (non-minimum value). Ordinarily these are not that important, but I am recording them in the diagram because I think they probably are in this case. You have been hinted! Anyway, eliminate column i and repeat the generation algorithm for this child i.
Continue doing this until all nodes have only one winning row or one column. At each step, to generate the "bounding profile" for that node, take the minimum+1 of each column, unless all values are the same, in which case it's just the minimum. The bounding profile is the second box in each node in the diagram, but I don't think it is important for this problem, although it might be. To make things a little easier to track on the diagram, I labelled each node with the rows that have been removed (counting from 1, not 0, sorry). HOWEVER, I have only completed child 1 and 2 of the entire tree (though I included a couple of nodes from 4's branch). That should be enough to get an algorithm that works, especially for the left sub-graph of the answer.
Now, if you look at the top-right corner, you'll see my current answer. The green and red links are the ones in the answer; my algorithm currently misses the red one. This is because "cost][me" and "cost" have the exact same violation pattern—I think this is as good as you can get without considering the inputs /cost me/ and /cost/ separately. Which I may have to do pretty soon, in which case this giant diagram might get split up. A couple of notes:
Side note: Apparently it is bad juju to let your computer fall to sleep with a USB flash drive attached. Who would have thought? The Finder AND bash told me everything was still connected but hung while trying to actually open anything. I had to disconnect the drive and suffer the placebo 'data damage' warning to get things going again. Bah Humbug!
21 Aug : I have an interesting problem, a puzzle, for you that I am going to post in a few days. I hoped to do it today, but I really want to finish drawing the tree diagram that makes the solution much easier. Actually, I *think* I've got the solution, but I haven't written and run it yet to make sure. So it may turn out to be the moral equivalent of posting my homework online and waiting for someone to do it. Well, except nobody has ever solved this homework and I intend to publish the answer when I get it. So, uh, if you happen to come up with a better solution than me I'll use it and put your name in the acknowledgements or something.
18 Aug : There are about three ways I use a REPL, some of which are less obvious than others. (By the way, REPL means "Interactive Mode" and stands for Read-Eval-Print Loop.)
The first use is the easiest and most obvious, as well as the first you're likely to encounter simply because you usually learn a language at the same time you discover that it has a REPL. One of the nicest features for exploratory use, besides obvious error messages, is Intellisense command completion. Emacs doesn't have very good support for this, so it's been a while since I've used one that did. I think wxPython's PyCrust is good, and F# Interactive might also have Intellisense. Haskell doesn't support definition of types at the command line, which cripples its exploratory use.
The second use is a bit stickier in terms of usability. If you define a function from a file, you need to the ability to redefine it easily when you change your mind. Ideally, the REPL will support both definition from snippets (in effect, an automatic cut-and-paste) as well as loading from files. Many REPLs also support two ways of loading from a file as well: one way is to paste the whole file without regard for namespacing niceties, and the other is to import the file in the way a compiler would.
The third I have only found myself using for research where I am the only user, and only in Emacs-supported Scheme, Common Lisp or Python REPLs, all of which support whole-program loading with some degree of facility. (Scheme the least of the three). This is hard to get right: Python's system is simple but has problems completely reloading sometimes, while Common Lisp's is confusing but of course works nicely once you've understood everything. This use is so hard to implement and useful for so few people that I can see why it's so rare. But I really like it because instead of an HTML output file, I get a live data structure that I can play with. I do have problems remembering to record all the changes I've made sometimes, though.
Note: My experience has been shaped primarily by Scheme, Common Lisp, Python, OCaml and to a lesser extent Ruby, Haskell and F#. If you've used REPLs for BeanShell, Javascript, Scala, Groovy (shudder), your experiences may differ. Please comment after I get the commenting problems sorted out.
13 Aug : Once again Caml shows that it wasn't designed but just sort of stuck together. It has two working ways to annotate separate compilation units with types. The first is to annotate the types inline, which is horrendously ugly due to several mistakes in the 'design' of the syntax:
let add_first (number : int) (i::is : int list) : int = i + number
All those parentheses are probably why SML coders prefer to curry their functions so that they can use commas to separate arguments like most languages do. Even so, I don't like all those types making my lines super-long. I'd prefer the type annotation to be on a separate line, which also makes it easier to kill if you change a bunch of stuff and want the compiler to infer the type for you.
The other possibility is to move the type annotations to a separate 'signature' file. Does this remind you of something? Could it be C++? Why yes I believe it is. I hate having to open a separate file to maintain and view types. But there is a third solution...it just happens not to work. You can declare a module inside your code file and stick the signature, which contains the type annotations for all your code, at the top of the module.
The problem is that this is a NESTED module, so you end up saying stupid stuff like Constraints.Constraints.id_voice. Does this remind you of something? Could it be Java? Why yes I believe it is. Fortunately, you can say use Constraints as a stupid workaround to shave off the first Constraints. The result ends up looking very much like Python, except that it is probably quite annoying to those proficient in Caml. I know I'm not completely happy.
If you're curious, here's the code for a module containing the code in the first example:
module Adder : sig val add_first int -> int list -> int end = struct let add_first number (i::is) = i + number let private_function i = i + 1 end (* In another file... *) open Adder Adder.add_first 12 [1;2;3];;
Also, I know that comments are not working all the time right now. My hosting company is having trouble providing me with the Python library to interface with MySQL. I might have it fixed soon.
7 Aug : I am about two-thirds moved in at the Wampler's upstairs apartment. The third that is missing mostly cleaning supplies and cooking implements, and I only really miss the latter (as yet). Unfortunately this week seems to be the hottest of the year—I'm dreading the cleaning at the end of the week.
I braved the lion's den today, the Parking Permit Office, and came away with Second Prize, a C Permit. I have yet to determine where this will let me park; I hope to upgrade it later to an A Permit, but it seemed like to much to attempt in one go. Especially since the bureaucrat-at-arms assured me that only teaching assistants get A Permits and so long as I remain a research assistant I Will. Not. Get one. I didn't want to argue, but my contracts for the last two years have had a clause stating that I am eligible for an A. Maybe that's changed. I got off a lot better than the guy in front of me, a new grad student in the English department. He wasn't on The List, and didn't have an Employee ID, both of which are Required. So no Permit for him! Anyway now I have to find out where C permit holders are allowed to park because I have no idea.
Meanwhile, I am running a six letter Gen right now on Jones, now that the air conditioner is fixed and I'm not worrying about frying it*. I expect it to run in about 22 hours and cons about 768 MB. Jones usually runs about 1.5 GB free, so as I worriedly sat there watching the memory usage tick upwards, I realised that I don't actually need to keep all those potential winners just to run Recursive Constraint Demotion. All you need for that is the violation profiles of the possible winners and then the form that was actually heard. Unfortunately, to get the correct violation profiles, you still need to look at every possible candidate, and I have no way to know when to improve that as yet. But I have some ideas, at least for straight Max/Dep constraints—I may end up hard-coding something to look for them especially.
Of course, I suspect the real problem is that most linguists drastically underspecify the constraint set and just assume that the example forms that they give are representative. And they usually are, even if there are jillions of nearly nonsense words that have the same violation profile for the three to five constraints they mention. This practise only breaks down if one of those nonsense words happens to be better than the actual winner, because it means that the proposed constraint ranking is wrong. Kartunnen pointed out that this actually happens, and I suspect it's a lot more often than the single example he gave.
Well, the new iMacs are out. (Finally!) As soon as I get the reimbursement from my flight to the ACL in Prague, I will be buying one. I know, they're a bad value for the money, Macs are only really compelling as laptops, blah blah. But they jive well with my philosophy so I am going to buy one anyway. Anyway I can still run Windows—VMWare released their consumer Mac virtualisation program a couple of days ago in preparation for this announcement, and Parallels has been around for a while (although reviews give better marks to VMWare). I also plan to try Linux and at least play around with it.
Also, Apple finally updated their keyboards and yes, its USB ports are FINALLY 2.0 like every other decent keyboard on the planet. It even looks better than the Dell green-light special** I have...I actually like the laptop form factor better for keyboards because it means the mouse is closer to your hand. (I would just move the mouse to the left side, but my left hand has enough trouble with wrist pain already.) The main advantage a desktop keyboard has is that the arrow keys are bigger and there are 19 (!) F keys. Plus it looks like the wireless keyboard doesn't have USB ports. So I'll probably get the wired one and buy the wireless one later if I miss the keyboard layout of my trusty Powerbook.
*The air conditioner in the comp ling room went out on
Thursday/Friday and it's been hot enough to worry about the
safety of the servers without air conditioning. We turned off
the non-web serving one (smith) and left jones on hoping that
it wouldn't meltdown.
**I have no idea why it's 40 bucks on that page. I only paid 20,
which is about right. Disregard the stupid plastic "wrist rest".
2 Aug : Free For All. Chapter 5 is my favourite so far. I had to stop reading at that point.
Also, I'm making a fair attempt at learning to use viper-mode in emacs. For after-the-fact editing tasks, the vi keyset is nicer because it has finer control over greater distances. For creation of new text, it's not so great because the inconvenience of switching modes loses for quick local edits. Of course, anecdotal accounts suggest that vi keys are easier on your hands in the long run, too. Perhaps it's because the cursor keys are on the right hand (h,j,k,l) which takes pressure off my over-used left hand. At any rate, viper-mode allows me to use BOTH emacs and vi keys in the appropriate mode, which is extremely powerful and difficult to keep straight. I'm not sure I'll keep it up. If nothing else it's good practise to keep my mind flexible I guess...
30 July : Here's another quandary I ran into about strong type systems. I very commonly use a tree representation that is (node, [children]) which stems originally from learning to represent trees as (node . children) in Lisp. A leaf just has an empty list since it has no children: (leaf, []). All well and good, this is not irregular as are some tree representations in dynamically typed languages. So I was thinking to myself, "if Haskell syntax is basically the same* as Python's except nicer in places, why am I not using Haskell?" So I wrote a toy function:
tw [row] = (row, [])
tw (n:ns) = (n, map (\ i -> tw ns) [1..3]
<interactive>:1:31:
Occurs check: cannot construct the infinite type: b = (a, [b])
Expected type: b
Inferred type: (a, [b])
In the application `tw ns'
In a lambda abstraction: \ i -> tw ns
At least the error message is better than Caml's (which I tried first). Of course this tree representation *works*, you just have to wrap it in type, give it a name:
data Tree = T (Int, [Tree]) tw [row] = T (row, []) tw (n:ns) = T (n, map (\ i -> tw ns) [1..3])
Of course Haskell then becomes MUCH more inconvenient even than Caml because you can't declare new data types in interactive mode, and new data types are not by default printable, so you can't see the results of tw without implementing the Show interface. These are not central to the problem of course, but more than enough to remind me why I don't use statically typed languages for prototyping yet. Probably someday soon they'll catch up; Caml is already almost as good (although extremely ugly) and Haskell is pretty if a little rough around the edges.
Oh, yeah, that tree question came up because I got a super-fast implementation of Gen (from OT) working based on tree search instead a search of all subsets. Well, super-fast compared to O(n!) of all subsets. I think it's O(n lg n) but I might be wrong—that's a pretty big improvement. Well, and it's not done yet, but the part that's done works.
*Only similar for functional style. Imperative Haskell looks similar to C rather than Java like imperative Python does. (Neither are very like those other languages.)
29 July : Bonus content free day! This is why I don't do classes.
class Col():
def __init__(self, constraints=None, vlns=None):
self.cons = constraints or []
self.vlns = vlns or [[]]
def win_fav(ns):
for n in ns:
if n > 0:
return True
else:
return False
def tied(ns):
for n in ns:
if n!=0:
return False
else:
return True
def lose_disfav(ns):
for n in ns:
if n < 0:
return False
else:
return True
def markcol(self, i):
return marked(self.cols[i])
def fav_active(self, i):
return win_fav(self.vlns[i])
### versus ! ###
# maybe someday I should create a Col class....nahhhh
positive = cur(lt, 0)
negative = cur(gt, 0)
win_fav = cur(some, positive)
tied = cur(every, lambda x:x==0)
lose_disfav = negate(cur(some,negative))
markcol = compose(marked, fst)
fav_active = compose(win_fav, snd)
Admittedly not all OO code is that bad at re-using abstractions, but most is. I especially hate the "encapsulated list" pattern visible in markcol and fav_active.
28 July : Guess what this code does without looking below to find out:
from util import *
MAX_NGRAM = 7
keys = {'1':'_,@', '2':'abc', '3':'def', '4':'ghi', '5':'jkl', '6':'mno',
'7':'pqrs', '8':'tuv', '9':'wxyz', "*":' ', '0':"+"}
corpus = dct.count(word[:n]
for line in open('/Users/zackman/Documents/ebooks/Democracy in America--Vol 1.txt')
for word in line.split()
for n in range(1,MAX_NGRAM)
if len(line) > n)
def predict(ks):
return sorted(map(''.join, lst.cross(*map(keys.get, ks))),
key=corpus.get, reverse=True)
print predict('263')[:5]
If you guessed "predictive texter", you're right. Like most good ideas, a fake prototype can be written in a dozen lines. There are many weak points that would be easy to improve. The corpus I used is over 150 years old and written by a French author. The predicter stops predicting for words over 6 letters long. All words are stored together. There is no facility for telling the predicter that part of a word is right and going from there. The predictor doesn't try to complete the whole word when you've only typed part of it. And naive counting has lots of problems that I don't even want to get into but you can read for yourself (warning: extremely boring book).
The idea I had for predicting words longer than 6 is to only pull up words in a dictionary, because frequency stops being such a good indicator for words that long. There is probably only one correct word anyway. But I don't have enough memory free on my laptop to play with that while I continue working on my REAL work.
For example, the example above (predict('263')[:5]) returns ['and', 'bod', 'ame', 'cod', 'coe'], which is good in that 'and' is the best word, but bad in that the subject matter shows through in result 3, 'ame', the prefix of 'America'. The corpus is Volume 1 of Alexis de Tocqueville's Democracy in America.
I got the idea for this from (1) getting a cell phone (2) talking about an internship/job with Tegic, who is the major provider of predictive texting. Obviously a real system is more complicated and cleverly optimised (and likely written in C++) but a 12-line prototype just makes you feel good all over.
In other news, Boy Scouts have taken over campus! Film at 11!
27 July : A couple of topics today. First I want to ridicule the 'Black Google' idea, where this guy made an inverted-colour Google home page to save energy. Except that only CRTs save energy from displaying black; most LCDs use *more* energy to display black because the backlight is on. What's more, the CRTs that are still in use are probably not generating anywhere near their fair share of Google queries because they are likely hooked up to old computers with limited (if any) Internet access. Light text on a dark background is very likely a usability problem to boot; I have an unfortunate addiction to this colour scheme when coding and it actually *reduces* battery life because you have to crank up brightness more to get the equivalent level of visibility. So to actually save energy on Google, just turn your LCD brightness down. Or switch from using a dryer to a clothes line.
Also, in my 'successful implementation of Gen' below, I mentioned that lst.cross would make you run out of memory for large values. It's pretty obvious; if your alphabet size is 45 (the average number of phonemes in English*), a word of length four has 445 possibilities. You *don't* want to store all the possibilities for a ten letter word. Unfortunately, my original recursive lst.cross does just that, generating the list incrementally:
def cross(*ls):
"""[[]] -> [[]] Return all permutations of lists in n.
Examples:
>>> cross([1],[2,3],[3])
[[1, 2, 3], [1, 3, 3]]
>>> cross([1,2,3],[2],[3])
[[1, 2, 3], [2, 2, 3], [3, 2, 3]]
"""
if not ls:
return [[]]
else:
rests = cross(*cdr(ls))
acc = []
for x in car(ls):
acc.extend([[x]+rest for rest in rests])
return acc
alfa = 'abc'
print cross(*([alfa] * 4)) # for a four letter word
Like all recursive solutions to recursive problems the code is quite short, but in this case it is not trivial to transform the recursion to a while loop because there is a quite a bit going on after the recursive call in the non-terminal branch. I played around with a continuation-passing transformation and got nowhere. Then I Googled for solutions and found a smart numeric solution to the power set problem.
A power set is not quite what I'm looking for: it's all subsets without replacement. I'm looking for all subsets with replacement. But the idea is similar. A power set of {a,b,c} is {{}, {a}, {b}, {c}, {ab}, {ac}, {bc}, {abc}} while my problem specifies number of times to choose from the alphabet: {a,b,c} * 2 is {{}, {a}, {b}, {c}, {aa}, {ab}, {ac}, {ba}, {bb}, {bc}, {ca}, {cb}, {cc}}. The numeric trick for power set is to notice that the elements chosen from the source set follow the same pattern as bits in a binary number. So you can loop up to 2n and generate all elements of the power set.
At first I thought this trick was useless for choosing all elements with replacement. But look! Reduce the alphabet to size two: {a,b} gives {{}, {a}, {b}, {aa}, {ab}, {ba}, {bb}} while the power set is {{}, {a}, {b}, {ba}, {bb}}. It's even more obvious if you have the alphabet {0,1,2,3,4,5,6,7,8,9}; for size 2 you get {{}, {0} ... {9}, {00}, {01} ... {42} ... {99}}. That's just counting, plus some zero fills. So here's what I ended up with.
def gen_strings(start=0, stop=100):
"100 is a hoky limit but you will still wait a LONG time to get them all"
alfa = '0123456789'
for n in xrange(len(alfa)**stop):
acc = []
i = 1
while i <= n:
acc.append(alfa[(n // i) % len(alfa)])
i *= len(alfa)
yield ''.join(acc)
while i < len(alfa)**stop:
acc.append(alfa[0])
i *= len(alfa)
yield ''.join(acc)
The first while loop just yields the normal counting numbers over whatever alphabet you're using. The second one provides all zero-prefixed variants. In other words, you don't normally count "01,02,03..." but "1,2,3...". The second while loop fills in those zeros. The down side is that you don't get the numbers in counting order any longer. But of the giant upside is that now you have an iterator that provides only as many answers as you want in constant space.
*The number of phonemes in English varies according to your dialect. For both 'cot' and 'caught', I have a single sound intermediate between Standard English's phonemes /ɑ/ and /ɔ/. So my phonemic inventory is one smaller (at least) than Standard English. Compare this to some Atlantic dialects that have MORE vowels down there.
25 July : I just finished writing (and running) a successful implementation of Gen from OT, and it was so exciting that I have to sit down and write about it right away. For those of you who don't understand OT's Gen function, but do know what Python generators are, you can probably guessed how I implemented it. Optimality Theory (OT) says that when we pronounce a word, we take our knowledge of its representation (sort of like the spelling) and generate an infinite number of possible pronunciations, choosing the one that best satisfies the constraints of your language.
That's all very nice for a theory, but the idea that we generate an infinite number of words for every word we say is ridiculous. It's even ridiculous for a computer implementation, which is what I'm making. Nevertheless, here is a naive, infinite implementation:
def gen_strings(start=0, stop=100):
"100 is a hoky but realistic limit (because of use of lst.cross)"
alfa = 'abdefghijklmnoprstuvwz'
for i in itertools.count(start):
if i > stop: return
for s in itertools.imap(''.join, lst.cross(*([alfa]*i))):
yield s
Well, not completely naive. I do allow the user to pass a stop value because ALL possible combinations of 100 characters with replacement is already. So it's already not infinite, just really big. Since I use lst.cross to generate the combinations, you'll run out of memory trying to generate them all anyway. Anyway, you can say for word in gen_strings(): print word and wait for a very long time before the last word prints. However, what we really want to do is look at words that could possibly appear—a lot of words are so unlike the input that they will always be beaten by some other word or set of words. Here's the resulting code:
def gen(i, o, cons):
winners = []
for cand in gen_strings(stop=len(o)):
if not bounding(table(i, cand, winners, cons)):
print cand, ":", [rcd.ot_eval(c,cand,i) for c in cons], '(', len(winners), ')'
winners.append(cand)
return winners
bounding returns the bounding set, the set of competing candidates that will *always* beat the current candidate word. If there are none, then the word is a possible winner and is added to the set so that it can bound. Of course this is a little lossy because there are some candidates early on that get added to winners which might be bounded by candidates added later, but this isn't a big deal compared to all the candidates we are *actually* getting rid of. On a toy problem, this eliminated all but 59 of the 506 candidates of length 2 or less and all but 59 of the 11,154 candidates of length 3 or less (which is the expected result; you can't pronounce a short word as super-long and get away with it).
Finally, this is only interesting if you follow OT, but the code is based on ROA number 363, and the definitions of bounding and table are given below.
def table(input, w, losers, constraints):
return [[rcd.ot_eval(c,l,input) - rcd.ot_eval(c,w,input) for l in losers]
for c in constraints]
def bounding(tab): # n^2 * k-1 = O(n^3)
while True: # O(k - 1) (average case is better)
promote, demote = lst.bifilter(rcd.lose_disfav, tab) # n^2
if not demote or not promote: return demote # 1
tab = filter_rows(promote, demote) # n^2 (ideally)
def filter_rows(promote, demote):
return lst.transpose(rcd.filterby(rcd.tied,
lst.transpose(promote),
lst.transpose(demote)))
The calls to lst.transpose are slow and only necessary to make the code prettier, so I probably ought to get rid of them.
23 July : Don't tell my dad, but I use a thesaurus a lot these days. At least it's an electronic one with good response time. I've been writing a lot lately and I have trouble thinking of sensible words. I keep trying to write "concomitant" instead of "associated" in one of my papers and I'm pretty sure even in an academic paper that's ridiculous. Then I wanted to avoid "pooh-poohed" in an e-mail because it sounds stupid and could only come up with "denigrated". Thesaurus to the rescue! Thank you, OS X Tiger! (Buy today, only $129)
22 July : Maybe you noticed the data structure at the top of my toy compiler yesterday:
(def place car) (def code cdr) (def make-exp cons) (def nil '())
This is a very common way to prototype code in Lisp, but it really ISN'T a data type declaration. In fact, the equivalent Javascript is just
function place(exp) { return exp[0] }
function code(exp) { return exp[1] }
function makeExp(place, code) { return [place, code] }
var nil = null // just because I like writing nil, that's all
Now it's obvious that I'm just wrapping a two-element list in a few functions and pretending it's a data structure. Why didn't I at least use the perfectly good struct capabilities built in to MzScheme?
It turns out that in Lisp, prototyping with lists as the only data structure has some advantages that make it one of the best prototyping languages. Unfortunately, those advantages are weaknesses when the time comes to create production code. Base Scheme doesn't have many ways to switch over, although other Lisps vary.
These properties allow you to be flexible, changing your mind many times without having to alter things to make the compiler happy. However, they have a downside as well. Without data type declarations, there is no machine-validated type documentation. With the ability to throw in and handle special cases at any point, lengthy code becomes fragile. And when you use an existing list function in an odd way, you violate expectations of its normal use. And of course things like encapsulation and mandated structure are good in their place.
So what you really need is the discipline to throw away your prototype once it's done its job. Fortunately, this is easy because linked lists have some undesirable performance properties; once you start running your prototype on real data, you will probably find that it needs a performance boost anyway. Even if you stick with Scheme, you'll find that industrial strength Scheme looks quite a bit different from the rapid-prototyping form that is its normal mode.
*Caml and Scheme are very similar semantically. The primary difference is the importance of data types, and the fact that everything is checked at compile time.
21 July : I hadn't realised it had been so long since I had written something here until a friend pointed it out. Well, I've been here and gone again and recently I've been doing a lot of writing on my qualifying paper about measuring British dialects with a computer. It's pretty interesting and makes me wish I knew more about British dialects. Not KNOWING what your computer is crunching is one of the downsides of computational linguistics (and sometimes, any linguistics).
Today I've been writing a toy compiler. I discovered my first edition Dragon Book while packing books in preparation for moving. I bought it off of ebay my last year of undergraduate to learn about parsing (which was not taught in any of the classes of course)**. It's a great book, although outdated by now since it was first printed in 1977*. Just reading through it makes you want to try the ideas, but at my first reading I was so focussed on parsing that I forgot that the book has more than 250 pages. In fact the last half of the book is all about code generation and optimisation and is now interesting to me that my interest in parsing has cooled (although I may take a seminar on it next semester).
Right now the code has basically three parts: The code generator (compiler), a simple virtual machine and some tests. The only thing the compiler does right now is put everything in tail form—new variables are created when they are assigned to. This is unrealistic for a real VM, so at this point I have created a compiler that emits BASICA rather than ASM. And I must say that the other part, the runtime, it very simple to write. BASIC evaluators are a snap because they are so simple.
Tail form is just the term I picked up from the Scheme community, by the way. All it really means is that there are no expressions and the only emitted statements are at most x = y op z. For example, ka = (x + y) - z is flattened out to:
t1 = x + y t2 = t1 - z ka = t2
Notice that the last line is unnecessary. I don't know how to fix that right now, but I expect I will find out as I read more of the Dragon Book. The concept of tail form is extremely useful because if you recognise it you will be able to notice isomorphisms like this one:
var a = [1, 2, 3] // Javascript
var a = new List<Integer>(); // C# (3) a.Add(1); a.Add(2); a.Add(3);
And if you are compiling Javascript to C# you will be able to compile the first form to the second.
Unfortunately at this point I have a bug. I have to pass the current instruction number around so that instructions that generate gotos will know where to jump and somehow the number is coming out too big. I know for sure that some numbers are off because of using a simple but incorrect macro I wrote, but that error should make the numbers smaller not larger. I think.
Here is the code. Feel free to point and laugh but remember this is toy code that is still in flux. If you fix some bugs please send me a patch.
*You might be interested in buying the newest edition. It looks just as good as the old
one, complete with cover art that is 15 years oodate at time of
printing.
**Ironically, the computer science page at C of O has a picture of
me giving a poster on parsing.
3 July : The problem with iterators in Python is that they fundamentally unfriendly to functional programming: they operate by destructive changes. Here's an example:
def main():
ls = map(read_unicode, FILES)
return [max_dist([dist(l1, l2) for l2 in after(l1, ls)]) for l1 in ls[:-1]]
def max_dist(distances):
"[(num,num)] -> (num,[num])"
return avg(map(fst, distances)) * 2 * 107, map(snd, distances)
def dist(l1,l2):
d = lev.totalavgdistance(map(unify, l1), map(unify, l2))
return d, sum(lev.flevenshtein(unify(s1), unify(s2), d)[-1][-1]
for s1,s2 in zip(l1,l2))
Look at the second line of main. See how those are both list comprehensions and not generator comprehensions? The outer one is that way merely for convenience, since I'm running from Interactive Mode and Python doesn't force iterators for display, just in case they're infinite. The second comprehension would be nicer as a generator comprehension though: it's just a bunch of temporaries returned from dist waiting to be averaged. The problem is that the temporaries are tuples, so dist has to unpack them, with map. At first I thought this was ideal, since map converts any iterator to a list, which is required for avg. But the problem is that I have to call map *twice* in order to get the second half of the tuples. This doesn't work because iterators can only be iterated once.
So you see the real problem; a functional programmer wants to think of iterators in terms of streams (lazy lists). These involve destructive changes, but they are invisible to the user: if you have a lazy list, asking for the first item will always return the same thing, you just don't know if the computer had to evaluate it or already had it stored. Functional programmers can only use generator expressions in cases where the subsequent code only needs the iterator interface (first, next, null?) and not the list interface (length, etc.) AND in the case that they know the iterator will only be evaluated once. The first condition obtains a lot, but the second is much harder to track—you have to remember that you are only allowed to use the iterator one time, in one place.
Slightly later: Here's a fix to max_dist that makes it mention the passed iterator only once. I'm still not happy, though. I'm having to do odd things in order to make the computer happy. It should be the other way round. (You might think this is more efficient since it only makes one pass, but I doubt it, since transpose calls zip, which conses a lot too.)
def max_dist(distances):
"iter<(num,num)> -> (num,[num])"
ds, sums = transpose(distances)
return avg(ds) * 2 * 107, sums
2 July : Talking with Kaleb after getting back to the US made me wonder how far back my saved files go, exactly. I poked around my old hard drive and found a back up from around 2000. This contained an even older backup that had a bunch of Visual Basic and QBasic programs I had written in the Windows 3.1 era. Last year I used the .NET upgrade wizard and noted that it does a good job given the difficulty of its task. So I thought I would throw at it not a program encrusted with the cruft of age, but still compiling on current systems, but ancient programs from the glory days of VB. (That was VB 3, in case you're curious.)
To start, I imported a VB 5 project, which came in with no problems. The only quibbles were the mapping of String * 1 (the idiomatic way to simulate chars in classic VB) onto a String with weird annotations. Or sometimes a wrapper class of String, because the translator had to satisfy the type system AND the object system, neither of which are particularly flexible. The human fix (translate the type to Char) is trivial, but it's things like this that are a deal-breaker for Real World projects. The other problem is that I never finished the program when I wrote it in 2000. I wrote a bunch of code that was supposed to build Huffman trees after learning about them in Discrete Math class, but never got it working. Oops.
Then I tried a VB 3 project. This turned out to be a problem, because devstudio looks for .vbp (VB project file) and there was none. However, I came back after a bit and realised that the contents of the .mak file from the VB 3 project was nearly identical to the .vbp file the from VB 5 project. So I changed the extension. In another 20 seconds I was using a program from my second week of Visual Basic class in 1995. The only thing missing was the updating between scroll bars and labels—the scroll bar event model changed in mysterious ways in .NET, although the wonky standalone focus indication* hasn't changed since Windows 3.1. This should tell you something, I'm not sure what.
In summary, VB.NET can still import code from twelve years ago if it is simple and you're willing to put some work into some trivial fixes. This is great for 40-line programs from high-school VB class, but it just doesn't cut it for the masses of spaghetti-code VB6 programs that would take a month just to cajole into compiling.
*Scroll bars that are not part of the window dressing are rendered using an entirely different process at great expense. So DON'T USE THEM!
*I'm not necessarily against the word 'blog', I just don't use it. Besides, look at the filename. The filename and real name should be similar.