More Problems with Statistical NLP || kuro5hin.org


	create account \| help/FAQ \| contact \| links \| search \| IRC \| site news

Everything

We need your support: buy an ad | premium membership

More Problems with Statistical NLP

By trane in trane's Diary
Sat May 05, 2012 at 11:00:11 AM EST
Tags: nlp, ai, coursera, stanford, nlp-class, cky, nltk, reinventing the wheel, conformity, honor code (all tags)

Programming Assignment 6 for Stanford's nlp-class is to implement a CKY parser.

As I work on it, and read the discussion forums on the class site, I feel that this has been done so many times before; the thoughts I have as I go through the assignment are predictable, I'm following the same trail countless others have already tread. What is the point of solving some problem that has been done to death? I want to try something different.

And those who started early and have already finished are so smug on the forums! They beat around the bush trying to give hints while bending over backwards to make sure they aren't violating the infernal honor code; they waste their creativity on finding devious ways of saying what they might otherwise say clearly and explicitly. The honor code makes the desire to share dishonorable. It encourages "the closed fist of a teacher who holds some knowledge back" (the opposite of Buddha's approach).

However, even Prof. Jurafsky seems to be backing off on the honor code somewhat. From this thread:

In any case, absolutely, it's fine to share code fragments as you help each other debug.
Posted by Dan Jurafsky (Lecturer)
on Wed 25 Apr 2012 12:05:35 PM PDT

---
I'm trying to use nltk to work through the assignment. Chapter 8 of the nltk book gives an example at the end:

>>> viterbi_parser = nltk.ViterbiParser(grammar)
>>> print viterbi_parser.parse(['Jack', 'saw', 'telescopes'])
(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)

The problem is, when I try to parse the example sentence "Cats scratch walls with claws" from the nlp-class homework example, I get:

ValueError: Grammar does not cover some of the input words: "'Cats', 'scratch',
'claws'".

So the Penn Treebank corpus sample that I induced the grammar from didn't contain the words "cats", "scratch", or "claws". And the parser chokes.
If I want to add that sentence to the corpus, I have to find where it's located (I think it might be in Users/trane/AppData/Roaming?), and add it in the Penn Treebank format, etc. It involves busy work, changing non-code files, then rerunning the training algorithm.
I would rather interact with the program directly at runtime to add words (or rules) it doesn't know, teach it new heuristics, correct it when it's wrong. (A problem with adding "cats" to the grammar productions is that the NNS productions then have to be re-normalized. It seems nltk's cky implementation has no smoothing to deal with Out-Of-Vocabulary words...)
I think the approach to AI taken in nlp-class is missing a key feature of learning: interactivity. They (the professors) seem more interested in training once, testing once (or at least pretending that they only ran against the test set once!), and reporting a score in a journal article, than in using these tools directly, in their personal lives.
Their role is to train and evaluate; they leave it to others to make the tool useful. But what if they emphasized building interesting programs that we can easily customize to suit our whims, from the start?
---
Further thought: perhaps what I need to do is induce the grammar from the homework's data instead of from the Penn Treebank corpus. So how easy is it to get nltk to read in that new data...?

Sponsors

Managed Hosting
VoxCAST Content Delivery
Raw Infrastructure

More Problems with Statistical NLP | 2 comments (2 topical, editorial, 0 hidden)

what is the point of redoing (none / 0) (#1)
by mumble on Sat May 05, 2012 at 12:17:25 PM EST

"solved problems"?

It is more effective to learn things if you nut it out yourself, than if you are just given the answer. An education is not just about knowing stuff, it is about being taught how to work things out yourself, from the stuff you already know.

eg, being given weekly assignment problems, even though they count nothing to your final grade. You don't do them for the grade, you do them to learn how to solve problems by yourself.

-----
stats for a better tomorrow
mumble lang on github
mumble lang blog
collected blog posts

You mean major in Performing Arts ? (none / 0) (#2)
by sye on Mon May 07, 2012 at 06:26:17 AM EST

Their role is to train and evaluate; they leave it to others to make the tool useful. But what if they emphasized building interesting programs that we can easily customize to suit our whims, from the start?

~~~~~~~~~~~~~~~~~~~~~~~
commentary - For a better sye@K5
~~~~~~~~~~~~~~~~~~~~~~~
ripple me ~~> ~allthingsgo: gateway to Garden of Perfect Brightess in crypto-cash
rubbing u ~~> ~procrasti: getaway to HE'LL
Hey! at least he was in a stable relationship. - procrasti
Enter K5 via my lair

More Problems with Statistical NLP | 2 comments (2 topical, 0 editorial, 0 hidden)

All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.