{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# word2vec " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.\n", "\n", "Related Papers by Tomas Mikolov et al from Google:\n", "\n", "- [Distributed Representations of Words and Phrases and their Compositionality]( https://arxiv.org/abs/1310.4546) \n", "- [Efficient Estimation of Word Representations in Vector Space\n", "](https://arxiv.org/abs/1301.3781)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip) and unzip it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run `word2phrase` to group up similar words \"Los Angeles\" to \"Los_Angeles\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file text8\n", "Words processed: 17000K Vocab size: 4399K Vocab size: 3104K \n", "Vocab size (unigrams + bigrams): 2419827\n", "Words in train file: 17005206\n", "Words written: 17000K\r" ] } ], "source": [ "word2vec.word2phrase('text8', 'text8-phrases', verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This created a `text8-phrases` file that we can use as a better input for `word2vec`.\n", "Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.\n", "\n", "Now actually train the word2vec model." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file text8-phrases\n", "Vocab size: 98331\n", "Words in train file: 15857306\n", "Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 243.11k " ] } ], "source": [ "word2vec.word2vec('text8-phrases', 'text8.bin', size=100, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That created a `text8.bin` file containing the word vectors in a binary format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we generate the clusters of the vectors based on the trained model." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting training using file /Users/drodriguez/Downloads/text8\n", "Vocab size: 71291\n", "Words in train file: 16718843\n", "Alpha: 0.000002 Progress: 100.04% Words/thread/sec: 317.72k " ] } ], "source": [ "word2vec.word2clusters('text8', 'text8-clusters.txt', 100, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That created a `text8-clusters.txt` with the cluster for every word in the vocabulary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predictions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import word2vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the `word2vec` binary file created above" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "model = word2vec.load('text8.bin')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at the vocabulary as a numpy array" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'],\n", " dtype='', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],\n", " dtype='