{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Distributed Database\n", "\n", "\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Why Hive?\n", "\n", "- Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.\n", "\n", "- Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Hive-Modules](./figures/hive-modules.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How does Hive work?\n", "\n", "\n", "- When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs. \n", "\n", "- Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the **job plan**\n", "\n", "- In other words, these generic modules function like mini language interpreters and the **language** to drive the computation is encoded in XML." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Hive Batch (old API)\n", "\n", "\n", "- Run hive commands from the termial\n", "\n", " $ hive -e \"dfs -ls /;\"\n", " \n", "- Run Hive scripts from the termimal\n", "\n", " $ hive -f /path/to/file/withqueries.hql" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Hive Interactive (old API)\n", "\n", "- Start Hive from a Terminal\n", "\n", " $ hive\n", "\n", "- Execute command within Hive \n", "\n", " hive> dfs -ls /;\n", " \n", "- Exit Hive\n", "\n", " hive> exit;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Beeline client with HiveServer2 (new interactive API)\n", "\n", "\n", "- Beeline is a Hive client for running Hive query and it also works with HiveQL file.\n", "\n", "\n", "- Beeline uses JDBC to connect to HiveServer2.\n", "\n", "\n", " beeline -u jdbc:hive2://$HIVESERVER2_HOST:$HIVESERVER2_PORT\n", " \n", "- For a recent hive server, the address and port could be\n", "\n", " beeline -u jdbc:hive2://master:10000\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Beeline client (batch mode)\n", "\n", "\n", "- Beeline executes an query\n", "\n", " beeline -u jdbc:hive2://master:10000 -e 'dfs -ls /;'\n", "\n", "- Beeline executes a script.\n", "\n", " beeline -u jdbc:hive2://master:10000 -f test.hql\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DFS Output
0Found 7 items
1drwxr-x--x - hadoop hadoop 0 201...
2drwxrwxrwx - flowagent hadoop 0 201...
3drwxr-x--x - student hadoop 0 201...
4drwxr-x--x - hadoop hadoop 0 201...
5drwxrwxrwx - root hadoop 0 201...
6drwxr-x--t - hadoop hadoop 0 201...
7-rw-r----- 2 student hadoop 19904316 201...
" ] }, "execution_count": 2, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "dfs -ls /;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Hive with Database" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Show Databases" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
database_name
0default
1mydb
" ] }, "execution_count": 3, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SHOW DATABASES;" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
database_name
0default
" ] }, "execution_count": 4, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SHOW DATABASES Like 'd*';" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Create a Database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table created!" ] } ], "source": [ "CREATE DATABASE IF NOT EXISTS mydb;" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table created!" ] } ], "source": [ "CREATE DATABASE IF NOT EXISTS financials LOCATION '/user/lifeng/hive';" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
database_name
0default
1financials
2mydb
" ] }, "execution_count": 7, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SHOW DATABASES;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Drop a database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "DROP DATABASE IF EXISTS financials;" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
database_name
0default
1mydb
" ] }, "execution_count": 9, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SHOW DATABASES;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Use some database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "USE mydb;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Create a table within the database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table created!" ] } ], "source": [ "CREATE TABLE IF NOT EXISTS mydb.employees (\n", " name\n", " STRING COMMENT 'Employee name',\n", " salary\n", " FLOAT COMMENT 'Employee salary',\n", " subordinates ARRAY COMMENT 'Names of subordinates',\n", " deductions MAP\n", " COMMENT 'Keys are deductions names, values are percentages',\n", " address\n", " STRUCT\n", " COMMENT 'Home address')\n", "COMMENT 'Description of the table'\n", "TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tab_name
0employees
" ] }, "execution_count": 14, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SHOW TABLES;" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Create an external table\n", "\n", "Assume we have a data file `stocks.txt` located in HDFS at `/user/lifeng/data`, we could connect it with Hive as an external table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table created!" ] } ], "source": [ "create external table if not exists stocks (\n", " symbol string, \n", " ymd string, \n", " price_open float, \n", " price_high float, \n", " price_low float, \n", " price_close float, \n", " volume int, \n", " price_adj_close float )\n", "row format delimited fields terminated by ',' \n", "location '/user/lifeng/data'; " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If you have an external file with a header/footer, you could exclude it with \n", "\n", "\n", " TBLPROPERTIES('skip.header.line.count'='1', 'skip.footer.line.count'='2');\n", " \n", "when you create the table." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Basic Statistics with Hive" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_c0
069.363
" ] }, "execution_count": 22, "metadata": { "image/png": { "height": 480, "width": 640 } }, "output_type": "execute_result" } ], "source": [ "SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Alternatives to Hive" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pig\n", "\n", "- Suppose you have one or more sources of input data and you need to perform a complex set of transformations to generate one or more collections of output data.\n", "\n", "- Pig is described as a data flow language, rather than a query language. In Pig, you write a series of declarative statements that define relations from other relations, where each new relation performs some new data transformation. Pig looks at these declarations and then **builds up a sequence of MapReduce jobs** to perform the transformations until the final results are computed the way that you want.\n", "\n", "- A drawback of Pig is that it uses a custom language not based on SQL. \n", "\n", "- See th Pig home page https://pig.apache.org/ for more information." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## HBase\n", "\n", "- HBase is inspired by Google’s Big Table.\n", "\n", "- It provides distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions)\n", "\n", "- HBase uses HDFS for durable file storage of data.\n", "\n", "- HBase also uses in-memory caching of data.\n", "\n", "- HBase doesn’t provide a query language like SQL, but Hive is now integrated with HBase.\n", "\n", "- See the HBase homepage https://hbase.apache.org/ for more information." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lab\n", "\n", "- Create an external table with Hive for the data `airdelay_small.csv` or `used_cars_data_small.csv`\n", "\n", "- Use the [Hive internal functions](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) to do basic statistic as we had with Hadoop." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## External Reading\n", "\n", "- [Hive User Documentation with Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)\n", " \n", "- Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" }, "rise": { "auto_select": "first", "autolaunch": false, "chalkboard": { "chalkEffect": 1, "chalkWidth": 4, "theme": "whiteboard", "transition": 800 }, "enable_chalkboard": true, "reveal_shortcuts": { "chalkboard": { "clear": "ctrl-k" } }, "start_slideshow_at": "selected", "theme": "white" } }, "nbformat": 4, "nbformat_minor": 2 }