{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# TF-IDF\n", "\n", "In information retrieval, **tf–idf** or **TFIDF**, short for **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.\n", "\n", "It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. \n", "\n", "Tf-idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf-idf.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Term frequency\n", "\n", "Suppose we have a set of text documents and wish to rank which document is most relevant to the query, \"the brown cow\". \n", "\n", "- A simple way to start out is by eliminating documents that do not contain all three words \"the\", \"brown\", and \"cow\", but this still leaves many documents. \n", "\n", "- To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its **term frequency**. However, in the case where the length of documents varies greatly, adjustments are often made (see definition below). The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as:\n", "\n", " The weight of a term that occurs in a document is simply proportional to the term frequency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In the case of the term frequency $tf(t,d)$, the simplest choice is to use the raw count of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw count by ft,d, then the simplest tf scheme is $tf(t,d) = f_{t,d}$. Other possibilities include[5]:128\n", "\n", "- **Boolean \"frequencies\"**: $tf(t,d) = 1$ if t occurs in d and 0 otherwise;\n", "\n", "- **term frequency adjusted for document length** : $f_{t,d} ÷$ (number of words in d)\n", "\n", "- **logarithmically scaled frequency**: $tf(t,d) = log (1 + f_{t,d})$\n", "\n", "- **augmented frequency**, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document:\n", "$$\n", "{\\displaystyle \\mathrm {tf} (t,d)=0.5+0.5\\cdot {\\frac {f_{t,d}}{\\max\\{f_{t',d}:t'\\in d\\}}}} \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Inverse document frequency\n", "\n", "Because the term \"the\" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word \"the\" more frequently, without giving enough weight to the more meaningful terms \"brown\" and \"cow\". The term \"the\" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words \"brown\" and \"cow\". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.\n", "\n", "Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called **Inverse Document Frequency (IDF)**, which became a cornerstone of term weighting:\n", "\n", " The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.\n", "$$\n", "{\\displaystyle \\mathrm {idf} (t,D)=\\log {\\frac {N}{|\\{d\\in D:t\\in d\\}|}}}$$\n", "with\n", "\n", "- ${\\displaystyle N}$: total number of documents in the corpus ${\\displaystyle N={|D|}}$\n", "\n", "- ${\\displaystyle |\\{d\\in D:t\\in d\\}|}$ : number of documents where the term ${\\displaystyle t}$ appears (i.e., ${\\displaystyle \\mathrm {tf} (t,d)\\neq 0}$). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to ${\\displaystyle 1+|\\{d\\in D:t\\in d\\}|}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Term frequency–Inverse document frequency\n", "\n", "Then tf–idf is calculated as\n", "\n", "$${\\displaystyle \\mathrm {tfidf} (t,d,D)=\\mathrm {tf} (t,d)\\cdot \\mathrm {idf} (t,D)}$$\n", "\n", "- A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; \n", "\n", "- The weights hence tend to filter out common terms.\n", "\n", "- Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. \n", "\n", "- As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example\n", "\n", "Let find interesting insights into _How I met your mother_ from its transcripts and one technique that kept coming up is TF/IDF.\n", "\n", "Let's first build a corpus. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "209" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import defaultdict\n", "import csv\n", "\n", "episodes = defaultdict(list)\n", "with open(\"sentences.csv\", \"r\") as sentences_file:\n", " reader = csv.reader(sentences_file, delimiter=',')\n", " for row in reader:\n", " episodes[row[1]].append(row[4])\n", "\n", "for episode_id, text in episodes.items():\n", " episodes[episode_id] = \"\".join(text)\n", "\n", "corpus = []\n", "for id, episode in sorted(episodes.items(), key=lambda t: t[0]):\n", " corpus.append(episode)\n", " \n", "len(corpus)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The corpus contains 209 entries (1 per episode), each of which is a string containing the transcript of that episode. \n", "\n", "Next it's time to train our TF/IDF model via [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) in `scikit-learn` module which is only a few lines of code. The most interesting parameter here is `ngram_range` - we're telling it to generate 2 and 3 word phrases along with the single words from the corpus.\n", "\n", "e.g. if we had the sentence \"Python is cool\" we'd end up with 6 phrases - 'Python', 'is', 'cool', 'Python is', 'Python is cool' and 'is cool'.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "498229\n" ] }, { "data": { "text/plain": [ "['00 does sound',\n", " '00 don',\n", " '00 don buy',\n", " '00 dressed',\n", " '00 dressed blond',\n", " '00 drunkenly',\n", " '00 drunkenly slurred',\n", " '00 fair',\n", " '00 fair tonight',\n", " '00 fall',\n", " '00 fall foliage',\n", " '00 far',\n", " '00 far impossible',\n", " '00 fart',\n", " '00 fart sure',\n", " '00 friends',\n", " '00 friends singing',\n", " '00 getting',\n", " '00 getting guys',\n", " '00 god']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_matrix = tf.fit_transform(corpus)\n", "feature_names = tf.get_feature_names() \n", "print(len(feature_names))\n", "feature_names[50:70]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "So we're got nearly 500,000 phrases and if we look at tfidf_matrix we'd expect it to be a $209 \\times 498229$ matrix - one row per episode, one column per phrase:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "<209x498229 sparse matrix of type ''\n", "\twith 740366 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_matrix" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This is what we've got although under the covers it's using a sparse representation to save space. Let's convert the matrix to dense format to explore further and find out why:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "498229" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dense = tfidf_matrix.todense()\n", "len(dense[0].tolist()[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The first value in each tuple is the phrase's position in our initial vector and also corresponds to the phrase's position in `feature_names` which allows us to map the scores back to phrases. Let's automate that lookup:\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ted 0.26332418150842635\n", "olives 0.1955650800833185\n", "marshall 0.1559959550807334\n", "yasmine 0.1521528557736922\n", "robin 0.13081226492371933\n", "barney 0.12479676406458672\n", "lily 0.12330259292854885\n", "signal 0.10372853594797252\n", "goanna 0.09807284390611527\n", "scene 0.09535105429648319\n", "cut 0.09180767406946594\n", "narrator 0.08652535148952796\n", "flashback 0.07836273694979438\n", "flashback date 0.07022439497247332\n", "ranjit 0.06937564856742862\n", "flashback date robin 0.05852032914372776\n", "ted yasmine 0.05852032914372776\n", "carl 0.05819191692744544\n", "eye patch 0.05432363335647736\n", "lebanese 0.05432363335647736\n" ] } ], "source": [ "episode = dense[0].tolist()[0]\n", "phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]\n", "sorted_phrase_scores = sorted(phrase_scores, key=lambda t: t[1] * -1)\n", "for phrase, score in [(feature_names[word_id], score) \n", " for (word_id, score) in sorted_phrase_scores][:20]:\n", " print('{0: <20} {1}'.format(phrase, score))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's find some interesting phrase in this episode. The (!) sign indicates this is not python command. It envokes a shell command to do this quickly." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m\u001b[K22\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K21,1,1,1,\"Barney: (on the phone) hey, so you know how I've always had a thing for half-Asian girls? Well, now I've got a new favorite: \u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K girls! \u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K girls are the new half-Asians.\"\r", "\r\n", "\u001b[32m\u001b[K53\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K52,1,1,1,\"Yasmine: Thanks, It's \u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K.\"\r", "\r\n", "\u001b[32m\u001b[K97\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K96,1,1,1,Barney: (ignoring) how does Carl land a \u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K girl?\r", "\r\n", "\u001b[32m\u001b[K269\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K268,1,1,1,\"Barney: So, (looks to the cabdriver) Ranjit... you must've done it with a \u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K girl.\"\r", "\r\n", "\u001b[32m\u001b[K56381\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K56380,178,8,18,\u001b[01;31m\u001b[KLebanese\u001b[m\u001b[K girls sprint to third base and then stay there?\r", "\r\n" ] } ], "source": [ "!grep -h -rni --color \"lebanese\" sentences.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For now we'll extract phrases for all episodes and write to CSV so we can explore more easily:\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing Document: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, " ] } ], "source": [ "with open(\"tfidf_scikit.csv\", \"w\") as file:\n", " writer = csv.writer(file, delimiter=\",\")\n", " writer.writerow([\"EpisodeId\", \"Phrase\", \"Score\"])\n", "\n", " doc_id = 0\n", " print(\"Writing Document:\", end = \" \")\n", " for doc in tfidf_matrix.todense():\n", " print(doc_id, end = \", \")\n", " word_id = 0\n", " for score in doc.tolist()[0]:\n", " if score > 0:\n", " word = feature_names[word_id]\n", " writer.writerow([doc_id+1, word, score])\n", " word_id +=1\n", " doc_id +=1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "And finally a quick look at the contents of the CSV:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EpisodeId,Phrase,Score\r", "\r\n", "1,b'2005',0.007399486527942556\r", "\r\n", "1,b'2005 seven',0.010864726671295472\r", "\r\n", "1,b'2005 seven just',0.011704065828745553\r", "\r\n", "1,b'2030',0.010004059867402614\r", "\r\n", "1,b'2030 kids',0.004959593990284747\r", "\r\n", "1,b'2030 kids intently',0.011704065828745553\r", "\r\n", "1,b'2030 narrator',0.008372424623677193\r", "\r\n", "1,b'2030 narrator kids',0.009429866904361137\r", "\r\n", "1,b'2030 son',0.010864726671295472\r", "\r\n" ] } ], "source": [ "!head tfidf_scikit.csv" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5rc1" } }, "nbformat": 4, "nbformat_minor": 2 }