Wow, this is cool...
All our N-gram are belong to you
Basically, Google has released (this was in Aug 2006, btw) a massive store of data designed as a "training corpus". That is, it contextualizes text by gathering chunks of 5 words. The reason this is useful is many and varied, but one quick example (and the reason I found it) is spell checking (via mefi, as usual):
Imagine you have a word, and you run a spell checker. How do you know that you don't have a homograph? That is, say you were going to say "you were going to say" but you typed "you where going to say" (like the recursion????)
Well, where is a correctly spelled word, so it may just pass a spell check. But you can use this data as a way to check context and figure out that where isn't used in such a situation... And then, suggest an alternative based upon other similarly spelled options (and statistical analysis)...
Apparently, it's a 6 DVD collection of word chunks that can be readily analyzed.
Sadly, it requires a fee of $150 bucks to obtain. I don't know how I feel about that. I'm not opposed to charging for data, especially when it comes to time and material costs. But that seems a bit exorbitant to me. And I know it's not just for joe blow... And most joe blows... if they really cared, would fork out the cash. It's just a slight harsh on my buzz, though....
Still, it's pretty cool it's being released. Can you imagine MS doing such a thing? I don't.
All our N-gram are belong to you
Basically, Google has released (this was in Aug 2006, btw) a massive store of data designed as a "training corpus". That is, it contextualizes text by gathering chunks of 5 words. The reason this is useful is many and varied, but one quick example (and the reason I found it) is spell checking (via mefi, as usual):
Imagine you have a word, and you run a spell checker. How do you know that you don't have a homograph? That is, say you were going to say "you were going to say" but you typed "you where going to say" (like the recursion????)
Well, where is a correctly spelled word, so it may just pass a spell check. But you can use this data as a way to check context and figure out that where isn't used in such a situation... And then, suggest an alternative based upon other similarly spelled options (and statistical analysis)...
Apparently, it's a 6 DVD collection of word chunks that can be readily analyzed.
Sadly, it requires a fee of $150 bucks to obtain. I don't know how I feel about that. I'm not opposed to charging for data, especially when it comes to time and material costs. But that seems a bit exorbitant to me. And I know it's not just for joe blow... And most joe blows... if they really cared, would fork out the cash. It's just a slight harsh on my buzz, though....
Still, it's pretty cool it's being released. Can you imagine MS doing such a thing? I don't.