{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Distributed Database\n",
"\n",
"\n",
"\n",
"## Feng Li\n",
"\n",
"### Central University of Finance and Economics\n",
"\n",
"### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n",
"### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Why Hive?\n",
"\n",
"- Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.\n",
"\n",
"- Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![Hive-Modules](./figures/hive-modules.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## How does Hive work?\n",
"\n",
"\n",
"- When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs. \n",
"\n",
"- Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the **job plan**\n",
"\n",
"- In other words, these generic modules function like mini language interpreters and the **language** to drive the computation is encoded in XML."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Hive Batch (old API)\n",
"\n",
"\n",
"- Run hive commands from the termial\n",
"\n",
" $ hive -e \"dfs -ls /;\"\n",
" \n",
"- Run Hive scripts from the termimal\n",
"\n",
" $ hive -f /path/to/file/withqueries.hql"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Hive Interactive (old API)\n",
"\n",
"- Start Hive from a Terminal\n",
"\n",
" $ hive\n",
"\n",
"- Execute command within Hive \n",
"\n",
" hive> dfs -ls /;\n",
" \n",
"- Exit Hive\n",
"\n",
" hive> exit;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beeline client with HiveServer2 (new interactive API)\n",
"\n",
"\n",
"- Beeline is a Hive client for running Hive query and it also works with HiveQL file.\n",
"\n",
"\n",
"- Beeline uses JDBC to connect to HiveServer2.\n",
"\n",
"\n",
" beeline -u jdbc:hive2://$HIVESERVER2_HOST:$HIVESERVER2_PORT\n",
" \n",
"- For a recent hive server, the address and port could be\n",
"\n",
" beeline -u jdbc:hive2://master:10000\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beeline client (batch mode)\n",
"\n",
"\n",
"- Beeline executes an query\n",
"\n",
" beeline -u jdbc:hive2://master:10000 -e 'dfs -ls /;'\n",
"\n",
"- Beeline executes a script.\n",
"\n",
" beeline -u jdbc:hive2://master:10000 -f test.hql\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" | \n",
" DFS Output | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Found 7 items | \n",
"
\n",
" \n",
" 1 | \n",
" drwxr-x--x - hadoop hadoop 0 201... | \n",
"
\n",
" \n",
" 2 | \n",
" drwxrwxrwx - flowagent hadoop 0 201... | \n",
"
\n",
" \n",
" 3 | \n",
" drwxr-x--x - student hadoop 0 201... | \n",
"
\n",
" \n",
" 4 | \n",
" drwxr-x--x - hadoop hadoop 0 201... | \n",
"
\n",
" \n",
" 5 | \n",
" drwxrwxrwx - root hadoop 0 201... | \n",
"
\n",
" \n",
" 6 | \n",
" drwxr-x--t - hadoop hadoop 0 201... | \n",
"
\n",
" \n",
" 7 | \n",
" -rw-r----- 2 student hadoop 19904316 201... | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 2,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"dfs -ls /;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Hive with Database"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- Show Databases"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" database_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" default | \n",
"
\n",
" \n",
" 1 | \n",
" mydb | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 3,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SHOW DATABASES;"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" database_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" default | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 4,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SHOW DATABASES Like 'd*';"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Create a Database"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Table created!"
]
}
],
"source": [
"CREATE DATABASE IF NOT EXISTS mydb;"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Table created!"
]
}
],
"source": [
"CREATE DATABASE IF NOT EXISTS financials LOCATION '/user/lifeng/hive';"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" database_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" default | \n",
"
\n",
" \n",
" 1 | \n",
" financials | \n",
"
\n",
" \n",
" 2 | \n",
" mydb | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 7,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SHOW DATABASES;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Drop a database"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"DROP DATABASE IF EXISTS financials;"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" database_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" default | \n",
"
\n",
" \n",
" 1 | \n",
" mydb | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 9,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SHOW DATABASES;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Use some database"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"USE mydb;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Create a table within the database"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Table created!"
]
}
],
"source": [
"CREATE TABLE IF NOT EXISTS mydb.employees (\n",
" name\n",
" STRING COMMENT 'Employee name',\n",
" salary\n",
" FLOAT COMMENT 'Employee salary',\n",
" subordinates ARRAY COMMENT 'Names of subordinates',\n",
" deductions MAP\n",
" COMMENT 'Keys are deductions names, values are percentages',\n",
" address\n",
" STRUCT\n",
" COMMENT 'Home address')\n",
"COMMENT 'Description of the table'\n",
"TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" tab_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" employees | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 14,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SHOW TABLES;"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Create an external table\n",
"\n",
"Assume we have a data file `stocks.txt` located in HDFS at `/user/lifeng/data`, we could connect it with Hive as an external table."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Table created!"
]
}
],
"source": [
"create external table if not exists stocks (\n",
" symbol string, \n",
" ymd string, \n",
" price_open float, \n",
" price_high float, \n",
" price_low float, \n",
" price_close float, \n",
" volume int, \n",
" price_adj_close float )\n",
"row format delimited fields terminated by ',' \n",
"location '/user/lifeng/data'; "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"If you have an external file with a header/footer, you could exclude it with \n",
"\n",
"\n",
" TBLPROPERTIES('skip.header.line.count'='1', 'skip.footer.line.count'='2');\n",
" \n",
"when you create the table."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Basic Statistics with Hive"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" _c0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 69.363 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 22,
"metadata": {
"image/png": {
"height": 480,
"width": 640
}
},
"output_type": "execute_result"
}
],
"source": [
"SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Alternatives to Hive"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pig\n",
"\n",
"- Suppose you have one or more sources of input data and you need to perform a complex set of transformations to generate one or more collections of output data.\n",
"\n",
"- Pig is described as a data flow language, rather than a query language. In Pig, you write a series of declarative statements that define relations from other relations, where each new relation performs some new data transformation. Pig looks at these declarations and then **builds up a sequence of MapReduce jobs** to perform the transformations until the final results are computed the way that you want.\n",
"\n",
"- A drawback of Pig is that it uses a custom language not based on SQL. \n",
"\n",
"- See th Pig home page https://pig.apache.org/ for more information."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## HBase\n",
"\n",
"- HBase is inspired by Google’s Big Table.\n",
"\n",
"- It provides distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions)\n",
"\n",
"- HBase uses HDFS for durable file storage of data.\n",
"\n",
"- HBase also uses in-memory caching of data.\n",
"\n",
"- HBase doesn’t provide a query language like SQL, but Hive is now integrated with HBase.\n",
"\n",
"- See the HBase homepage https://hbase.apache.org/ for more information."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Lab\n",
"\n",
"- Create an external table with Hive for the data `airdelay_small.csv` or `used_cars_data_small.csv`\n",
"\n",
"- Use the [Hive internal functions](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) to do basic statistic as we had with Hadoop."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## External Reading\n",
"\n",
"- [Hive User Documentation with Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)\n",
" \n",
"- Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
},
"rise": {
"auto_select": "first",
"autolaunch": false,
"chalkboard": {
"chalkEffect": 1,
"chalkWidth": 4,
"theme": "whiteboard",
"transition": 800
},
"enable_chalkboard": true,
"reveal_shortcuts": {
"chalkboard": {
"clear": "ctrl-k"
}
},
"start_slideshow_at": "selected",
"theme": "white"
}
},
"nbformat": 4,
"nbformat_minor": 2
}