{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introduction to Hadoop\n",
    "## Feng Li\n",
    "\n",
    "### Central University of Finance and Economics\n",
    "\n",
    "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n",
    "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## What is Hadoop?\n",
    "\n",
    "- Hadoop is a platform that provides both distributed storage and computational capabilities.\n",
    "\n",
    "- Hadoop is a distributed **master-worker** architecture consists of the **Hadoop Distributed File System (HDFS)** for storage and **MapReduce** for computational capabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## A Brief History of Hadoop\n",
    "\n",
    "- Hadoop was created by Doug Cutting.\n",
    "\n",
    "- At the time Google had published papers that described its novel distributed filesystem, the Google File System ( GFS ), and MapReduce, a computational framework for parallel processing.\n",
    "\n",
    "- The successful implementation of these papers’ concepts resulted in the Hadoop project.\n",
    "\n",
    "- Who use Hadoop?\n",
    "\n",
    "    - Facebook uses Hadoop, Hive, and HB ase for data warehousing and real-time application serving.\n",
    "    - Twitter uses Hadoop, Pig, and HB ase for data analysis, visualization, social graph analysis, and machine learning.\n",
    "    - Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization...\n",
    "    - eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL , Last.fm..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**But we (statisticians, financial analysts) are not yet there!**\n",
    "\n",
    "--- and we should!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![hadoop-architecture](./figures/hadoop-architecture.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Core Hadoop components: HDFS \n",
    "\n",
    "- HDFS is the storage component of Hadoop\n",
    "\n",
    "- It’s a distributed file system.\n",
    "- Logical representation of the components in HDFS : the **NameNode** and the **DataNode**.\n",
    "\n",
    "- HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed.\n",
    "\n",
    "- HDFS isn’t designed to work well with random reads over small files due to its optimization for sustained throughput."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![mapreduce](./figures/hdfs.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Core Hadoop components: MapReduce\n",
    "\n",
    "\n",
    "- MapReduce is a batch-based, distributed computing framework.\n",
    "- It allows you to parallelize work over a large amount of raw data.\n",
    "- This type of work, which could take days or longer using conventional serial programming techniques, can be reduced down to minutes using MapReduce on a Hadoop cluster.\n",
    "- MapReduce allows the programmer to focus on addressing business needs, rather than getting tangled up in distributed system complications.\n",
    "- MapReduce doesn’t lend itself to use cases that need real-time data access.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![mapreduce](./figures/mapreduce-architecture.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## The building blocks of Hadoop\n",
    "\n",
    "- NameNode\n",
    "- DataNode\n",
    "- Secondary NameNode\n",
    "- JobTracker\n",
    "- TaskTracker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## NameNode\n",
    "\n",
    "- The NameNode is the master of HDFS that directs the worker DataNode daemons to perform the low-level I/O tasks.\n",
    "\n",
    "- The NameNode keeps track of how your fi les are broken down into fi le blocks, which nodes store those blocks, and the overall health of the distributed file system.\n",
    "\n",
    "- The NameNode is a single point of failure of your Hadoop cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Secondary NameNode\n",
    "\n",
    "- An assistant daemon for monitoring the state of the cluster HDFS.\n",
    "\n",
    "- Each cluster has one Secondary NameNode.\n",
    "\n",
    "- The secondary NameNode snapshots help minimize the downtime and loss of data due to the failure of NameNode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## DataNode\n",
    "\n",
    "- Each worker machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system -- reading and writing HDFS blocks to actual files on the local file system.\n",
    "\n",
    "- DataNodes are constantly reporting to the NameNode.\n",
    "\n",
    "- Each of the DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## JobTracker\n",
    "\n",
    "- There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.\n",
    "\n",
    "- The JobTracker determines the execution plan by determining which fi les to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefi ned limit of retries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## TaskTracker\n",
    "\n",
    "\n",
    "- Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns.\n",
    "\n",
    "- If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Setting up Hadoop \n",
    "\n",
    "- Basic requirements\n",
    "\n",
    "    - Linux machines (master 1, workers >= two)\n",
    "    - SSH (with passless login)\n",
    "    - Setup environment variables: `JAVA_HOME` , `HADOOP_HOME`\n",
    "\n",
    "- Cluster setup: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "- Core Hadoop configuration files\n",
    "\n",
    "    - `core-site.xml`\n",
    "    - `mapred-site.xml`\n",
    "    - `hdfs-site.xml`\n",
    "    \n",
    "- Demo Hadoop config at  https://github.com/feng-li/hadoop-spark-conf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Work with Hadoop File System"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]\n",
      "  CLASSNAME            run the class named CLASSNAME\n",
      " or\n",
      "  where COMMAND is one of:\n",
      "  fs                   run a generic filesystem user client\n",
      "  version              print the version\n",
      "  jar <jar>            run a jar file\n",
      "                       note: please use \"yarn jar\" to launch\n",
      "                             YARN applications, not this command.\n",
      "  checknative [-a|-h]  check native hadoop and compression libraries availability\n",
      "  distcp <srcurl> <desturl> copy file or directories recursively\n",
      "  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive\n",
      "  classpath            prints the class path needed to get the\n",
      "  credential           interact with credential providers\n",
      "                       Hadoop jar and the required libraries\n",
      "  daemonlog            get/set the log level for each daemon\n",
      "  trace                view and modify Hadoop tracing settings\n",
      "\n",
      "Most commands print help when invoked w/o parameters.\n"
     ]
    }
   ],
   "source": [
    "hadoop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hadoop 2.7.2\n",
      "Subversion git@gitlab.alibaba-inc.com:soe/emr-hadoop.git -r 01979868624477e85d8958501eb27a460ce81e4c\n",
      "Compiled by root on 2018-08-31T09:14Z\n",
      "Compiled with protoc 2.5.0\n",
      "From source with checksum 4447ed9f24dcd981df7daaadd5bafc0\n",
      "This command was run using /opt/apps/ecm/service/hadoop/2.7.2-1.3.1/package/hadoop-2.7.2-1.3.1/share/hadoop/common/hadoop-common-2.7.2.jar\n"
     ]
    }
   ],
   "source": [
    "hadoop version "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/lib/jvm/java-1.8.0\n"
     ]
    }
   ],
   "source": [
    "echo $JAVA_HOME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/lib/hadoop-current\n"
     ]
    }
   ],
   "source": [
    "echo $HADOOP_HOME"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.4-yarn-shuffle.jar:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.4-yarn-shuffle.jar\n"
     ]
    }
   ],
   "source": [
    "echo $HADOOP_CLASSPATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/etc/ecm/hadoop-conf\n"
     ]
    }
   ],
   "source": [
    "echo $HADOOP_CONF_DIR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Usage: hadoop fs [generic options]\n",
      "\t[-appendToFile <localsrc> ... <dst>]\n",
      "\t[-cat [-ignoreCrc] <src> ...]\n",
      "\t[-checksum <src> ...]\n",
      "\t[-chgrp [-R] GROUP PATH...]\n",
      "\t[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]\n",
      "\t[-chown [-R] [OWNER][:[GROUP]] PATH...]\n",
      "\t[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]\n",
      "\t[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]\n",
      "\t[-count [-q] [-h] <path> ...]\n",
      "\t[-cp [-f] [-p | -p[topax]] <src> ... <dst>]\n",
      "\t[-createSnapshot <snapshotDir> [<snapshotName>]]\n",
      "\t[-deleteSnapshot <snapshotDir> <snapshotName>]\n",
      "\t[-df [-h] [<path> ...]]\n",
      "\t[-du [-s] [-h] <path> ...]\n",
      "\t[-expunge]\n",
      "\t[-find <path> ... <expression> ...]\n",
      "\t[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]\n",
      "\t[-getfacl [-R] <path>]\n",
      "\t[-getfattr [-R] {-n name | -d} [-e en] <path>]\n",
      "\t[-getmerge [-nl] <src> <localdst>]\n",
      "\t[-help [cmd ...]]\n",
      "\t[-ls [-d] [-h] [-R] [<path> ...]]\n",
      "\t[-mkdir [-p] <path> ...]\n",
      "\t[-moveFromLocal <localsrc> ... <dst>]\n",
      "\t[-moveToLocal <src> <localdst>]\n",
      "\t[-mv <src> ... <dst>]\n",
      "\t[-put [-f] [-p] [-l] <localsrc> ... <dst>]\n",
      "\t[-renameSnapshot <snapshotDir> <oldName> <newName>]\n",
      "\t[-rm [-f] [-r|-R] [-skipTrash] <src> ...]\n",
      "\t[-rmdir [--ignore-fail-on-non-empty] <dir> ...]\n",
      "\t[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]\n",
      "\t[-setfattr {-n name [-v value] | -x name} <path>]\n",
      "\t[-setrep [-R] [-w] <rep> <path> ...]\n",
      "\t[-stat [format] <path> ...]\n",
      "\t[-tail [-f] <file>]\n",
      "\t[-test -[defsz] <path>]\n",
      "\t[-text [-ignoreCrc] <src> ...]\n",
      "\t[-touchz <path> ...]\n",
      "\t[-truncate [-w] <length> <path> ...]\n",
      "\t[-usage [cmd ...]]\n",
      "\n",
      "Generic options supported are\n",
      "-conf <configuration file>     specify an application configuration file\n",
      "-D <property=value>            use value for given property\n",
      "-fs <local|namenode:port>      specify a namenode\n",
      "-jt <local|resourcemanager:port>    specify a ResourceManager\n",
      "-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster\n",
      "-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.\n",
      "-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.\n",
      "\n",
      "The general command line syntax is\n",
      "bin/hadoop command [genericOptions] [commandOptions]\n",
      "\n"
     ]
    },
    {
     "ename": "",
     "evalue": "255",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "hadoop fs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 6 items\n",
      "drwxr-x--x   - hadoop    hadoop          0 2020-01-06 13:27 /apps\n",
      "drwxr-x--x   - lifeng    hadoop          0 2020-02-20 10:54 /data\n",
      "drwxrwxrwx   - flowagent hadoop          0 2020-01-06 13:27 /emr-flow\n",
      "drwxr-x--x   - hadoop    hadoop          0 2020-02-10 22:20 /spark-history\n",
      "drwxrwxrwx   - root      hadoop          0 2020-02-26 21:36 /tmp\n",
      "drwxr-x--t   - hadoop    hadoop          0 2020-02-22 15:56 /user\n"
     ]
    }
   ],
   "source": [
    "hadoop fs -ls /"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 4 items\n",
      "drwx------   - hadoop  hadoop          0 2020-01-06 13:29 /user/hadoop\n",
      "drwxr-x--x   - hadoop  hadoop          0 2020-01-06 13:27 /user/hive\n",
      "drwxr-x--x   - lifeng  hadoop          0 2020-02-21 12:06 /user/lifeng\n",
      "drwx------   - student hadoop          0 2020-02-22 15:28 /user/student\n"
     ]
    }
   ],
   "source": [
    "hadoop fs -ls /user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "hadoop fs -put /opt/apps/ecm/service/hive/2.3.3-1.0.2/package/apache-hive-2.3.3-1.0.2-bin/binary-package-licenses/asm-LICENSE ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 2 items\n",
      "drwxr-x--x   - lifeng hadoop          0 2020-02-10 22:20 /user/lifeng/.sparkStaging\n",
      "-rw-r-----   2 lifeng hadoop       1511 2020-02-26 21:36 /user/lifeng/asm-LICENSE\n"
     ]
    }
   ],
   "source": [
    "hadoop fs -ls /user/lifeng"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "hadoop fs -mv asm-LICENSE license.txt"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}