{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# word2vec "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google.\n",
    "\n",
    "Related Papers by Tomas Mikolov et al from Google:\n",
    "\n",
    "- [Distributed Representations of Words and Phrases and their Compositionality]( https://arxiv.org/abs/1310.4546) \n",
    "- [Efficient Estimation of Word Representations in Vector Space\n",
    "](https://arxiv.org/abs/1301.3781)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip) and unzip it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run `word2phrase` to group up similar words \"Los Angeles\" to \"Los_Angeles\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting training using file text8\n",
      "Words processed: 17000K     Vocab size: 4399K   Vocab size: 3104K  \n",
      "Vocab size (unigrams + bigrams): 2419827\n",
      "Words in train file: 17005206\n",
      "Words written: 17000K\r"
     ]
    }
   ],
   "source": [
    "word2vec.word2phrase('text8', 'text8-phrases', verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This created a `text8-phrases` file that we can use as a better input for `word2vec`.\n",
    "Note that you could easily skip this previous step and use the text data as input for `word2vec` directly.\n",
    "\n",
    "Now actually train the word2vec model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting training using file text8-phrases\n",
      "Vocab size: 98331\n",
      "Words in train file: 15857306\n",
      "Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 243.11k  "
     ]
    }
   ],
   "source": [
    "word2vec.word2vec('text8-phrases', 'text8.bin', size=100, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That created a `text8.bin` file containing the word vectors in a binary format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we generate the clusters of the vectors based on the trained model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting training using file /Users/drodriguez/Downloads/text8\n",
      "Vocab size: 71291\n",
      "Words in train file: 16718843\n",
      "Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 317.72k  "
     ]
    }
   ],
   "source": [
    "word2vec.word2clusters('text8', 'text8-clusters.txt', 100, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That created a `text8-clusters.txt` with the cluster for every word in the vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import word2vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the `word2vec` binary file created above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = word2vec.load('text8.bin')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a look at the vocabulary as a numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['</s>', 'the', 'of', ..., 'dakotas', 'nias', 'burlesques'],\n",
       "      dtype='<U78')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or take a look at the whole matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(98331, 100)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.vectors.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,\n",
       "         0.10955409,  0.00693387],\n",
       "       [ 0.07306823,  0.1179086 ,  0.10995189, ...,  0.09345266,\n",
       "        -0.1312812 , -0.00915683],\n",
       "       [ 0.26229969,  0.02270839,  0.05854911, ...,  0.03924898,\n",
       "        -0.03867628,  0.21437503],\n",
       "       ...,\n",
       "       [-0.1427108 ,  0.10650002,  0.07283197, ...,  0.14563465,\n",
       "        -0.06967127,  0.037186  ],\n",
       "       [ 0.06538665, -0.04184594,  0.13385373, ...,  0.08183857,\n",
       "        -0.07006828, -0.09386028],\n",
       "       [-0.00991228, -0.12096601,  0.10771658, ...,  0.01684521,\n",
       "        -0.143217  , -0.10602982]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can retreive the vector of individual words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100,)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['dog'].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.06666815,  0.12450022,  0.02513653,  0.12673911,  0.13396765,\n",
       "       -0.00938436,  0.06476378,  0.15387769,  0.05472341, -0.08388881])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['dog'][:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the distance between two or more (all combinations) words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('dog', 'cat', 0.8693732680572173),\n",
       " ('dog', 'fish', 0.5900484800297155),\n",
       " ('cat', 'fish', 0.6269017149314428)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.distance(\"dog\", \"cat\", \"fish\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Similarity\n",
    "\n",
    "We can do simple queries to retreive words similar to \"socks\" based on cosine similarity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([ 2437,  5478,  7593, 10230,  3964,  9963,  2428, 10309,  4812,\n",
       "         2391]),\n",
       " array([0.86937327, 0.83396105, 0.77854628, 0.7692265 , 0.76743628,\n",
       "        0.7612772 , 0.7600788 , 0.75935677, 0.75693881, 0.75438956]))"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indexes, metrics = model.similar(\"dog\")\n",
    "indexes, metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returned a tuple with 2 items:\n",
    "1. numpy array with the indexes of the similar words in the vocabulary\n",
    "2. numpy array with cosine similarity to each word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the words for those indexes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['cat', 'cow', 'goat', 'pig', 'dogs', 'rabbit', 'bear', 'rat',\n",
       "       'wolf', 'girl'], dtype='<U78')"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.vocab[indexes]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a helper function to create a combined response as a numpy [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "rec.array([('cat', 0.86937327), ('cow', 0.83396105), ('goat', 0.77854628),\n",
       "           ('pig', 0.7692265 ), ('dogs', 0.76743628),\n",
       "           ('rabbit', 0.7612772 ), ('bear', 0.7600788 ),\n",
       "           ('rat', 0.75935677), ('wolf', 0.75693881),\n",
       "           ('girl', 0.75438956)],\n",
       "          dtype=[('word', '<U78'), ('metric', '<f8')])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.generate_response(indexes, metrics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is easy to make that numpy array a pure python response:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('cat', 0.8693732680572173),\n",
       " ('cow', 0.8339610529888226),\n",
       " ('goat', 0.7785462766666428),\n",
       " ('pig', 0.7692265048531302),\n",
       " ('dogs', 0.7674362783482181),\n",
       " ('rabbit', 0.7612771996422674),\n",
       " ('bear', 0.7600788045286304),\n",
       " ('rat', 0.7593567655129181),\n",
       " ('wolf', 0.7569388070301634),\n",
       " ('girl', 0.754389556345068)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.generate_response(indexes, metrics).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Phrases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we trained the model with the output of `word2phrase` we can ask for similarity of \"phrases\", basically compained words such as \"Los Angeles\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('san_francisco', 0.8876351265573288),\n",
       " ('san_diego', 0.8652920422732189),\n",
       " ('seattle', 0.8387625165949533),\n",
       " ('las_vegas', 0.8325965377422355),\n",
       " ('california', 0.8252775393303263),\n",
       " ('miami', 0.8167069457881345),\n",
       " ('detroit', 0.8164911899252103),\n",
       " ('chicago', 0.813283620659967),\n",
       " ('cincinnati', 0.8116379669114295),\n",
       " ('cleveland', 0.810708205429068)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indexes, metrics = model.similar('los_angeles')\n",
    "model.generate_response(indexes, metrics).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analogies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Its possible to do more complex queries like analogies such as: `king - man + woman = queen` \n",
    "This method returns the same as `cosine` the indexes of the words in the vocab and the metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([1087, 6768, 1145, 7523, 1335, 8419, 3141, 1827,  344, 4980]),\n",
       " array([0.28823424, 0.26614362, 0.26265608, 0.26111525, 0.26091172,\n",
       "        0.25844542, 0.25781944, 0.25678284, 0.25424551, 0.2529607 ]))"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'])\n",
    "indexes, metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('queen', 0.28823424120681784),\n",
       " ('regent', 0.26614361576778933),\n",
       " ('prince', 0.2626560787162791),\n",
       " ('empress', 0.2611152451318436),\n",
       " ('wife', 0.26091172315990346),\n",
       " ('aragon', 0.25844541581050506),\n",
       " ('monarch', 0.25781944140528035),\n",
       " ('throne', 0.256782835877586),\n",
       " ('son', 0.25424550637754495),\n",
       " ('heir', 0.25296070456687614)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.generate_response(indexes, metrics).tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see get the cluster number for individual words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['</s>', 'the', 'of', ..., 'bredon', 'skirting', 'santamaria'],\n",
       "      dtype='<U29')"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters.vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see get all the words grouped on an specific cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(206,)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters.get_words_on_cluster(90).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['along', 'associated', 'relations', 'relationship', 'deal',\n",
       "       'combined', 'contact', 'connection', 'respect', 'mixed'],\n",
       "      dtype='<U29')"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters.get_words_on_cluster(90)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can add the clusters to the word2vec model and generate a response that includes the clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.clusters = clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "indexes, metrics = model.analogy(pos=[\"paris\", \"germany\"], neg=[\"france\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('berlin', 0.3187078682472152, 15),\n",
       " ('vienna', 0.28562803640143397, 12),\n",
       " ('munich', 0.28527806428082675, 21),\n",
       " ('moscow', 0.27085681100243797, 74),\n",
       " ('leipzig', 0.2697639527846636, 8),\n",
       " ('st_petersburg', 0.25841328545046965, 61),\n",
       " ('prague', 0.2571333430942206, 72),\n",
       " ('bonn', 0.2546126113385251, 8),\n",
       " ('dresden', 0.2471285069069249, 71),\n",
       " ('warsaw', 0.2450778083401204, 74)]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.generate_response(indexes, metrics).tolist()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}