However, even Prof. Jurafsky seems to be backing off on the honor code somewhat. From this thread:
In any case, absolutely, it's fine to share code fragments as you help each other debug.
Posted by Dan Jurafsky (Lecturer)
on Wed 25 Apr 2012 12:05:35 PM PDT
---
I'm trying to use nltk to work through the assignment. Chapter 8 of the nltk book gives an example at the end:
>>> viterbi_parser = nltk.ViterbiParser(grammar)
>>> print viterbi_parser.parse(['Jack', 'saw', 'telescopes'])
(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)
The problem is, when I try to parse the example sentence "Cats scratch walls with claws" from the nlp-class homework example, I get:
ValueError: Grammar does not cover some of the input words: "'Cats', 'scratch',
'claws'".
So the Penn Treebank corpus sample that I induced the grammar from didn't contain the words "cats", "scratch", or "claws". And the parser chokes.
If I want to add that sentence to the corpus, I have to find where it's located (I think it might be in Users/trane/AppData/Roaming?), and add it in the Penn Treebank format, etc. It involves busy work, changing non-code files, then rerunning the training algorithm.
I would rather interact with the program directly at runtime to add words (or rules) it doesn't know, teach it new heuristics, correct it when it's wrong. (A problem with adding "cats" to the grammar productions is that the NNS productions then have to be re-normalized. It seems nltk's cky implementation has no smoothing to deal with Out-Of-Vocabulary words...)
I think the approach to AI taken in nlp-class is missing a key feature of learning: interactivity. They (the professors) seem more interested in training once, testing once (or at least pretending that they only ran against the test set once!), and reporting a score in a journal article, than in using these tools directly, in their personal lives.
Their role is to train and evaluate; they leave it to others to make the tool useful. But what if they emphasized building interesting programs that we can easily customize to suit our whims, from the start?
---
Further thought: perhaps what I need to do is induce the grammar from the homework's data instead of from the Penn Treebank corpus. So how easy is it to get nltk to read in that new data...?