Hadoop is a platform that provides both distributed storage and computational capabilities.
Hadoop is a distributed master-worker architecture consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities.
Hadoop was created by Doug Cutting.
At the time Google had published papers that described its novel distributed filesystem, the Google File System ( GFS ), and MapReduce, a computational framework for parallel processing.
The successful implementation of these papers’ concepts resulted in the Hadoop project.
Who use Hadoop?
But we (statisticians, financial analysts) are not yet there!
--- and we should!
HDFS is the storage component of Hadoop
It’s a distributed file system.
Logical representation of the components in HDFS : the NameNode and the DataNode.
HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed.
HDFS isn’t designed to work well with random reads over small files due to its optimization for sustained throughput.
The NameNode is the master of HDFS that directs the worker DataNode daemons to perform the low-level I/O tasks.
The NameNode keeps track of how your fi les are broken down into fi le blocks, which nodes store those blocks, and the overall health of the distributed file system.
The NameNode is a single point of failure of your Hadoop cluster
An assistant daemon for monitoring the state of the cluster HDFS.
Each cluster has one Secondary NameNode.
The secondary NameNode snapshots help minimize the downtime and loss of data due to the failure of NameNode
Each worker machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system -- reading and writing HDFS blocks to actual files on the local file system.
DataNodes are constantly reporting to the NameNode.
Each of the DataNodes informs the NameNode of the blocks it’s currently storing. After this mapping is complete, the DataNodes continually poll the NameNode to provide information regarding local changes as well as receive instructions to create, move, or delete blocks from the local disk.
There is only one JobTracker daemon per Hadoop cluster. It’s typically run on a server as a master node of the cluster.
The JobTracker determines the execution plan by determining which fi les to process, assigns nodes to different tasks, and monitors all tasks as they’re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefi ned limit of retries.
Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns.
If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.
Basic requirements
Cluster setup: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
Core Hadoop configuration files
Demo Hadoop config at https://github.com/feng-li/hadoop-spark-conf
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
hadoop version
Hadoop 2.7.2
hadoop fs
Usage: hadoop fs [generic options]
hadoop fs -ls /
Found 6 items drwxr-x--x - hadoop hadoop 0 2020-01-06 13:27 /apps drwxr-x--x - lifeng hadoop 0 2020-02-20 10:54 /data drwxrwxrwx - flowagent hadoop 0 2020-01-06 13:27 /emr-flow drwxr-x--x - hadoop hadoop 0 2020-02-10 22:20 /spark-history drwxrwxrwx - root hadoop 0 2020-02-26 21:36 /tmp drwxr-x--t - hadoop hadoop 0 2020-02-22 15:56 /user
hadoop fs -ls /user
Found 4 items drwx------ - hadoop hadoop 0 2020-01-06 13:29 /user/hadoop drwxr-x--x - hadoop hadoop 0 2020-01-06 13:27 /user/hive drwxr-x--x - lifeng hadoop 0 2020-02-21 12:06 /user/lifeng drwx------ - student hadoop 0 2020-02-22 15:28 /user/student
hadoop fs -put /opt/apps/ecm/service/hive/2.3.3-1.0.2/package/apache-hive-2.3.3-1.0.2-bin/binary-package-licenses/asm-LICENSE .
hadoop fs -ls /user/lifeng
Found 2 items drwxr-x--x - lifeng hadoop 0 2020-02-10 22:20 /user/lifeng/.sparkStaging -rw-r----- 2 lifeng hadoop 1511 2020-02-26 21:36 /user/lifeng/asm-LICENSE
hadoop fs -mv asm-LICENSE license.txt