{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Distributed Database\n",
    "\n",
    "\n",
    "\n",
    "## Feng Li\n",
    "\n",
    "### Central University of Finance and Economics\n",
    "\n",
    "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n",
    "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Why Hive?\n",
    "\n",
    "- Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.\n",
    "\n",
    "- Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![Hive-Modules](./figures/hive-modules.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How does Hive work?\n",
    "\n",
    "\n",
    "- When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.    \n",
    "\n",
    "- Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the **job plan**\n",
    "\n",
    "- In other words, these generic modules function like mini language interpreters and the **language** to drive the computation is encoded in XML."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Hive Batch (old API)\n",
    "\n",
    "\n",
    "- Run hive commands from the termial\n",
    "\n",
    "    $ hive -e \"dfs -ls /;\"\n",
    "    \n",
    "- Run Hive scripts from the termimal\n",
    "\n",
    "    $ hive -f /path/to/file/withqueries.hql"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Hive Interactive (old API)\n",
    "\n",
    "- Start Hive from a Terminal\n",
    "\n",
    "    $ hive\n",
    "\n",
    "- Execute command within Hive \n",
    "\n",
    "    hive> dfs -ls /;\n",
    "    \n",
    "- Exit Hive\n",
    "\n",
    "    hive> exit;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Beeline client with HiveServer2 (new interactive API)\n",
    "\n",
    "\n",
    "- Beeline is a Hive client for running Hive query and it also works with HiveQL file.\n",
    "\n",
    "\n",
    "- Beeline uses JDBC to connect to HiveServer2.\n",
    "\n",
    "\n",
    "      beeline -u jdbc:hive2://$HIVESERVER2_HOST:$HIVESERVER2_PORT\n",
    "  \n",
    "- For a recent hive server, the address and port could be\n",
    "\n",
    "      beeline -u jdbc:hive2://master:10000\n",
    "     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Beeline client (batch mode)\n",
    "\n",
    "\n",
    "- Beeline executes an query\n",
    "\n",
    "        beeline -u jdbc:hive2://master:10000 -e 'dfs -ls /;'\n",
    "\n",
    "- Beeline executes a script.\n",
    "\n",
    "        beeline -u jdbc:hive2://master:10000 -f test.hql\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DFS Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Found 7 items</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>drwxr-x--x   - hadoop    hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>drwxrwxrwx   - flowagent hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>drwxr-x--x   - student   hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>drwxr-x--x   - hadoop    hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>drwxrwxrwx   - root      hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>drwxr-x--t   - hadoop    hadoop          0 201...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-rw-r-----   2 student   hadoop   19904316 201...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 2,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfs -ls /;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Hive with Database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Show Databases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>database_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>default</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>mydb</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 3,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SHOW DATABASES;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>database_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>default</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 4,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SHOW DATABASES Like 'd*';"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Create a Database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Table created!"
     ]
    }
   ],
   "source": [
    "CREATE DATABASE IF NOT EXISTS mydb;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Table created!"
     ]
    }
   ],
   "source": [
    "CREATE DATABASE IF NOT EXISTS financials LOCATION '/user/lifeng/hive';"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>database_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>default</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>financials</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>mydb</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 7,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SHOW DATABASES;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Drop a database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "DROP DATABASE IF EXISTS financials;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>database_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>default</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>mydb</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 9,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SHOW DATABASES;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Use some database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "USE mydb;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Create a table within the database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Table created!"
     ]
    }
   ],
   "source": [
    "CREATE TABLE IF NOT EXISTS mydb.employees (\n",
    "    name\n",
    "        STRING COMMENT 'Employee name',\n",
    "    salary\n",
    "        FLOAT COMMENT 'Employee salary',\n",
    "    subordinates ARRAY<STRING> COMMENT 'Names of subordinates',\n",
    "    deductions MAP<STRING, FLOAT>\n",
    "        COMMENT 'Keys are deductions names, values are percentages',\n",
    "    address\n",
    "        STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>\n",
    "        COMMENT 'Home address')\n",
    "COMMENT 'Description of the table'\n",
    "TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tab_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>employees</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 14,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SHOW TABLES;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Create an external table\n",
    "\n",
    "Assume we have a data file `stocks.txt` located in HDFS at `/user/lifeng/data`, we could connect it with Hive as an external table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Table created!"
     ]
    }
   ],
   "source": [
    "create external table if not exists stocks (\n",
    "    symbol string, \n",
    "    ymd string, \n",
    "    price_open float, \n",
    "    price_high float, \n",
    "    price_low float, \n",
    "    price_close float, \n",
    "    volume int, \n",
    "    price_adj_close float )\n",
    "row format delimited fields terminated by ',' \n",
    "location '/user/lifeng/data'; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "If you have an external file with a header/footer, you could exclude it with \n",
    "\n",
    "\n",
    "    TBLPROPERTIES('skip.header.line.count'='1', 'skip.footer.line.count'='2');\n",
    "    \n",
    "when you create the table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Basic Statistics with Hive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>_c0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>69.363</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ]
     },
     "execution_count": 22,
     "metadata": {
      "image/png": {
       "height": 480,
       "width": 640
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Alternatives to Hive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pig\n",
    "\n",
    "- Suppose you have one or more sources of input data and you need to perform a complex set of transformations to generate one or more collections of output data.\n",
    "\n",
    "- Pig is described as a data flow language, rather than a query language. In Pig, you write a series of declarative statements that define relations from other relations, where each new relation performs some new data transformation. Pig looks at these declarations and then **builds up a sequence of MapReduce jobs** to perform the transformations until the final results are computed the way that you want.\n",
    "\n",
    "- A drawback of Pig is that it uses a custom language not based on SQL. \n",
    "\n",
    "- See th Pig home page https://pig.apache.org/ for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## HBase\n",
    "\n",
    "- HBase is inspired by Google’s Big Table.\n",
    "\n",
    "- It provides distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions)\n",
    "\n",
    "- HBase uses HDFS for durable file storage of data.\n",
    "\n",
    "- HBase also uses in-memory caching of data.\n",
    "\n",
    "- HBase doesn’t provide a query language like SQL, but Hive is now integrated with HBase.\n",
    "\n",
    "- See the HBase homepage https://hbase.apache.org/ for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Lab\n",
    "\n",
    "- Create an external table with Hive for the data `airdelay_small.csv` or `used_cars_data_small.csv`\n",
    "\n",
    "- Use the [Hive internal functions](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) to do basic statistic as we had with Hadoop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## External Reading\n",
    "\n",
    "- [Hive User Documentation with Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)\n",
    "    \n",
    "- Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  },
  "rise": {
   "auto_select": "first",
   "autolaunch": false,
   "chalkboard": {
    "chalkEffect": 1,
    "chalkWidth": 4,
    "theme": "whiteboard",
    "transition": 800
   },
   "enable_chalkboard": true,
   "reveal_shortcuts": {
    "chalkboard": {
     "clear": "ctrl-k"
    }
   },
   "start_slideshow_at": "selected",
   "theme": "white"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}