{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Hadoop\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is Hadoop?\n", "\n", "- Hadoop is a platform that provides both distributed storage and computational capabilities.\n", "\n", "- Hadoop is a distributed **master-worker** architecture consists of the **Hadoop Distributed File System (HDFS)** for storage and **MapReduce** for computational capabilities." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Brief History of Hadoop\n", "\n", "- Hadoop was created by Doug Cutting.\n", "\n", "- At the time Google had published papers that described its novel distributed filesystem, the Google File System ( GFS ), and MapReduce, a computational framework for parallel processing.\n", "\n", "- The successful implementation of these papers’ concepts resulted in the Hadoop project.\n", "\n", "- Who use Hadoop?\n", "\n", " - Facebook uses Hadoop, Hive, and HB ase for data warehousing and real-time application serving.\n", " - Twitter uses Hadoop, Pig, and HB ase for data analysis, visualization, social graph analysis, and machine learning.\n", " - Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization...\n", " - eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL , Last.fm..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**But we (statisticians, financial analysts) are not yet there!**\n", "\n", "--- and we should!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![hadoop-architecture](./figures/hadoop-architecture.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Core Hadoop components: HDFS \n", "\n", "- HDFS is the storage component of Hadoop\n", "\n", "- It’s a distributed file system.\n", "- Logical representation of the components in HDFS : the **NameNode** and the **DataNode**.\n", "\n", "- HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed.\n", "\n", "- HDFS isn’t designed to work well with random reads over small files due to its optimization for sustained throughput." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![mapreduce](./figures/hdfs.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Core Hadoop components: MapReduce\n", "\n", "\n", "- MapReduce is a batch-based, distributed computing framework.\n", "- It allows you to parallelize work over a large amount of raw data.\n", "- This type of work, which could take days or longer using conventional serial programming techniques, can be reduced down to minutes using MapReduce on a Hadoop cluster.\n", "- MapReduce allows the programmer to focus on addressing business needs, rather than getting tangled up in distributed system complications.\n", "- MapReduce doesn’t lend itself to use cases that need real-time data access.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![mapreduce](./figures/mapreduce-architecture.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The building blocks of Hadoop\n", "\n", "- NameNode\n", "- DataNode\n", "- Secondary NameNode\n", "- JobTracker\n", "- TaskTracker" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## NameNode\n", "\n", "- The NameNode is the master of HDFS that directs the worker DataNode daemons to perform the low-level I/O tasks.\n", "\n", "- The NameNode keeps track of how your fi les are broken down into fi le blocks, which nodes store those blocks, and the overall health of the distributed file system.\n", "\n", "- The NameNode is a single point of failure of your Hadoop cluster" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Secondary NameNode\n", "\n", "- An assistant daemon for monitoring the state of the cluster HDFS.\n", "\n", "- Each cluster has one Secondary NameNode.\n", "\n", "- The secondary NameNode snapshots help minimize the downtime and loss of data due to the failure of NameNode" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## DataNode\n", "\n", "- Each worker machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system -- reading and writing HDFS blocks to actual files on the local file system.\n", "\n", "- DataNodes are constantly reporting to the NameNode.\n", "\n", "- Each of the DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## JobTracker\n", "\n", "- There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.\n", "\n", "- The JobTracker determines the execution plan by determining which fi les to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefi ned limit of retries." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## TaskTracker\n", "\n", "\n", "- Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns.\n", "\n", "- If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Setting up Hadoop \n", "\n", "- Basic requirements\n", "\n", " - Linux machines (master 1, workers >= two)\n", " - SSH (with passless login)\n", " - Setup environment variables: `JAVA_HOME` , `HADOOP_HOME`\n", "\n", "- Cluster setup: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Core Hadoop configuration files\n", "\n", " - `core-site.xml`\n", " - `mapred-site.xml`\n", " - `hdfs-site.xml`\n", " \n", "- Demo Hadoop config at https://github.com/feng-li/hadoop-spark-conf" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Work with Hadoop File System" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]\n", " CLASSNAME run the class named CLASSNAME\n", " or\n", " where COMMAND is one of:\n", " fs run a generic filesystem user client\n", " version print the version\n", " jar run a jar file\n", " note: please use \"yarn jar\" to launch\n", " YARN applications, not this command.\n", " checknative [-a|-h] check native hadoop and compression libraries availability\n", " distcp copy file or directories recursively\n", " archive -archiveName NAME -p * create a hadoop archive\n", " classpath prints the class path needed to get the\n", " credential interact with credential providers\n", " Hadoop jar and the required libraries\n", " daemonlog get/set the log level for each daemon\n", " trace view and modify Hadoop tracing settings\n", "\n", "Most commands print help when invoked w/o parameters.\n" ] } ], "source": [ "hadoop" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hadoop 2.7.2\n", "Subversion git@gitlab.alibaba-inc.com:soe/emr-hadoop.git -r 01979868624477e85d8958501eb27a460ce81e4c\n", "Compiled by root on 2018-08-31T09:14Z\n", "Compiled with protoc 2.5.0\n", "From source with checksum 4447ed9f24dcd981df7daaadd5bafc0\n", "This command was run using /opt/apps/ecm/service/hadoop/2.7.2-1.3.1/package/hadoop-2.7.2-1.3.1/share/hadoop/common/hadoop-common-2.7.2.jar\n" ] } ], "source": [ "hadoop version " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/lib/jvm/java-1.8.0\n" ] } ], "source": [ "echo $JAVA_HOME" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/lib/hadoop-current\n" ] } ], "source": [ "echo $HADOOP_HOME" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.4-yarn-shuffle.jar:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.4-yarn-shuffle.jar\n" ] } ], "source": [ "echo $HADOOP_CLASSPATH" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/etc/ecm/hadoop-conf\n" ] } ], "source": [ "echo $HADOOP_CONF_DIR" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: hadoop fs [generic options]\n", "\t[-appendToFile ... ]\n", "\t[-cat [-ignoreCrc] ...]\n", "\t[-checksum ...]\n", "\t[-chgrp [-R] GROUP PATH...]\n", "\t[-chmod [-R] PATH...]\n", "\t[-chown [-R] [OWNER][:[GROUP]] PATH...]\n", "\t[-copyFromLocal [-f] [-p] [-l] ... ]\n", "\t[-copyToLocal [-p] [-ignoreCrc] [-crc] ... ]\n", "\t[-count [-q] [-h] ...]\n", "\t[-cp [-f] [-p | -p[topax]] ... ]\n", "\t[-createSnapshot []]\n", "\t[-deleteSnapshot ]\n", "\t[-df [-h] [ ...]]\n", "\t[-du [-s] [-h] ...]\n", "\t[-expunge]\n", "\t[-find ... ...]\n", "\t[-get [-p] [-ignoreCrc] [-crc] ... ]\n", "\t[-getfacl [-R] ]\n", "\t[-getfattr [-R] {-n name | -d} [-e en] ]\n", "\t[-getmerge [-nl] ]\n", "\t[-help [cmd ...]]\n", "\t[-ls [-d] [-h] [-R] [ ...]]\n", "\t[-mkdir [-p] ...]\n", "\t[-moveFromLocal ... ]\n", "\t[-moveToLocal ]\n", "\t[-mv ... ]\n", "\t[-put [-f] [-p] [-l] ... ]\n", "\t[-renameSnapshot ]\n", "\t[-rm [-f] [-r|-R] [-skipTrash] ...]\n", "\t[-rmdir [--ignore-fail-on-non-empty] ...]\n", "\t[-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set ]]\n", "\t[-setfattr {-n name [-v value] | -x name} ]\n", "\t[-setrep [-R] [-w] ...]\n", "\t[-stat [format] ...]\n", "\t[-tail [-f] ]\n", "\t[-test -[defsz] ]\n", "\t[-text [-ignoreCrc] ...]\n", "\t[-touchz ...]\n", "\t[-truncate [-w] ...]\n", "\t[-usage [cmd ...]]\n", "\n", "Generic options supported are\n", "-conf specify an application configuration file\n", "-D use value for given property\n", "-fs specify a namenode\n", "-jt specify a ResourceManager\n", "-files specify comma separated files to be copied to the map reduce cluster\n", "-libjars specify comma separated jar files to include in the classpath.\n", "-archives specify comma separated archives to be unarchived on the compute machines.\n", "\n", "The general command line syntax is\n", "bin/hadoop command [genericOptions] [commandOptions]\n", "\n" ] }, { "ename": "", "evalue": "255", "output_type": "error", "traceback": [] } ], "source": [ "hadoop fs" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 6 items\n", "drwxr-x--x - hadoop hadoop 0 2020-01-06 13:27 /apps\n", "drwxr-x--x - lifeng hadoop 0 2020-02-20 10:54 /data\n", "drwxrwxrwx - flowagent hadoop 0 2020-01-06 13:27 /emr-flow\n", "drwxr-x--x - hadoop hadoop 0 2020-02-10 22:20 /spark-history\n", "drwxrwxrwx - root hadoop 0 2020-02-26 21:36 /tmp\n", "drwxr-x--t - hadoop hadoop 0 2020-02-22 15:56 /user\n" ] } ], "source": [ "hadoop fs -ls /" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 4 items\n", "drwx------ - hadoop hadoop 0 2020-01-06 13:29 /user/hadoop\n", "drwxr-x--x - hadoop hadoop 0 2020-01-06 13:27 /user/hive\n", "drwxr-x--x - lifeng hadoop 0 2020-02-21 12:06 /user/lifeng\n", "drwx------ - student hadoop 0 2020-02-22 15:28 /user/student\n" ] } ], "source": [ "hadoop fs -ls /user" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "hadoop fs -put /opt/apps/ecm/service/hive/2.3.3-1.0.2/package/apache-hive-2.3.3-1.0.2-bin/binary-package-licenses/asm-LICENSE ." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 2 items\n", "drwxr-x--x - lifeng hadoop 0 2020-02-10 22:20 /user/lifeng/.sparkStaging\n", "-rw-r----- 2 lifeng hadoop 1511 2020-02-26 21:36 /user/lifeng/asm-LICENSE\n" ] } ], "source": [ "hadoop fs -ls /user/lifeng" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "hadoop fs -mv asm-LICENSE license.txt" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" } }, "nbformat": 4, "nbformat_minor": 2 }