{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Statistics with Hadoop Streaming\n", "\n", "## Feng Li\n", "\n", "### Central University of Finance and Economics\n", "\n", "### [feng.li@cufe.edu.cn](feng.li@cufe.edu.cn)\n", "### Course home page: [https://feng.li/distcomp](https://feng.li/distcomp)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A simple line count program" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#! /usr/bin/env python3\n", "\n", "import sys\n", "count = 0\n", "# data = []\n", "for line in sys.stdin: # read input from stdin\n", " count += 1\n", " # data.append(line) \n", "print(count) # print goes to sys.stdout\n" ] } ], "source": [ "cat line_count.py" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29\n" ] } ], "source": [ "cat license.txt | python3 line_count.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### We could write the **long** Hadoop command into an `.sh` file, say `run_line_count.sh`" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#! /usr/bin/sh\n", "\n", "TASKNAME=line_count\n", "\n", "hadoop fs -rm -r ./output/\n", "hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar \\\n", " -jobconf mapred.job.name=$TASKNAME \\\n", " -input /user/lifeng/data/license.txt \\\n", " -output ./output \\\n", " -file \"line_count.py\" \\\n", " -mapper \"/usr/bin/cat\" \\\n", " -reducer \"python3 line_count.py\" \\\n", " -numReduceTasks 1 \n" ] } ], "source": [ "cat run_line_count.sh" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/12/06 23:46:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 30 minutes.\n", "Moved: 'hdfs://emr-header-1.cluster-41697:9000/user/lifeng/output' to trash at: hdfs://emr-header-1.cluster-41697:9000/user/lifeng/.Trash/Current\n", "18/12/06 23:46:55 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.\n", "18/12/06 23:46:56 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.\n", "18/12/06 23:46:56 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name\n", "packageJobJar: [line_count.py, /tmp/hadoop-unjar6165594769339738850/] [] /tmp/streamjob1645783056874142978.jar tmpDir=null\n", "18/12/06 23:46:56 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/06 23:46:57 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/06 23:46:57 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/06 23:46:57 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/06 23:46:57 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries\n", "18/12/06 23:46:57 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]\n", "18/12/06 23:46:57 INFO mapred.FileInputFormat: Total input paths to process : 1\n", "18/12/06 23:46:57 INFO mapreduce.JobSubmitter: number of splits:16\n", "18/12/06 23:46:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542711134746_0075\n", "18/12/06 23:46:57 INFO impl.YarnClientImpl: Submitted application application_1542711134746_0075\n", "18/12/06 23:46:57 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-41697:20888/proxy/application_1542711134746_0075/\n", "18/12/06 23:46:57 INFO mapreduce.Job: Running job: job_1542711134746_0075\n", "18/12/06 23:47:02 INFO mapreduce.Job: Job job_1542711134746_0075 running in uber mode : false\n", "18/12/06 23:47:02 INFO mapreduce.Job: map 0% reduce 0%\n", "18/12/06 23:47:08 INFO mapreduce.Job: map 100% reduce 0%\n", "18/12/06 23:47:13 INFO mapreduce.Job: map 100% reduce 100%\n", "18/12/06 23:47:13 INFO mapreduce.Job: Job job_1542711134746_0075 completed successfully\n", "18/12/06 23:47:14 INFO mapreduce.Job: Counters: 49\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=895\n", "\t\tFILE: Number of bytes written=2258315\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\t\tHDFS: Number of bytes read=14800\n", "\t\tHDFS: Number of bytes written=4\n", "\t\tHDFS: Number of read operations=51\n", "\t\tHDFS: Number of large read operations=0\n", "\t\tHDFS: Number of write operations=2\n", "\tJob Counters \n", "\t\tLaunched map tasks=16\n", "\t\tLaunched reduce tasks=1\n", "\t\tData-local map tasks=16\n", "\t\tTotal time spent by all maps in occupied slots (ms)=3969756\n", "\t\tTotal time spent by all reduces in occupied slots (ms)=229905\n", "\t\tTotal time spent by all map tasks (ms)=67284\n", "\t\tTotal time spent by all reduce tasks (ms)=1965\n", "\t\tTotal vcore-milliseconds taken by all map tasks=67284\n", "\t\tTotal vcore-milliseconds taken by all reduce tasks=1965\n", "\t\tTotal megabyte-milliseconds taken by all map tasks=125955648\n", "\t\tTotal megabyte-milliseconds taken by all reduce tasks=7356960\n", "\tMap-Reduce Framework\n", "\t\tMap input records=29\n", "\t\tMap output records=29\n", "\t\tMap output bytes=1540\n", "\t\tMap output materialized bytes=1535\n", "\t\tInput split bytes=1904\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=25\n", "\t\tReduce shuffle bytes=1535\n", "\t\tReduce input records=29\n", "\t\tReduce output records=1\n", "\t\tSpilled Records=58\n", "\t\tShuffled Maps =16\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=16\n", "\t\tGC time elapsed (ms)=3148\n", "\t\tCPU time spent (ms)=15270\n", "\t\tPhysical memory (bytes) snapshot=8800583680\n", "\t\tVirtual memory (bytes) snapshot=63775653888\n", "\t\tTotal committed heap usage (bytes)=12616990720\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Input Format Counters \n", "\t\tBytes Read=12896\n", "\tFile Output Format Counters \n", "\t\tBytes Written=4\n", "18/12/06 23:47:14 INFO streaming.StreamJob: Output directory: ./output\n" ] } ], "source": [ "sh run_line_count.sh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Simple statistics with MapReduce" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AAPL\t10\n", "CSCO\t10\n", "GOOG\t5\n", "MSFT\t10\n", "YHOO\t10\n" ] } ], "source": [ "cat stocks.txt | python3 stocks_mapper.py | python3 stocks_reducer.py" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AAPL\t2009-01-02\t88.315\t\n", "AAPL\t2008-01-02\t197.055\t\n", "AAPL\t2007-01-03\t85.045\t\n", "AAPL\t2006-01-03\t73.565\t\n", "AAPL\t2005-01-03\t64.035\t\n", "AAPL\t2004-01-02\t21.415\t\n", "AAPL\t2003-01-02\t14.58\t\n", "AAPL\t2002-01-02\t22.675\t\n", "AAPL\t2001-01-02\t14.88\t\n", "AAPL\t2000-01-03\t108.405\t\n", "CSCO\t2009-01-02\t16.685\t\n", "CSCO\t2008-01-02\t26.77\t\n", "CSCO\t2007-01-03\t27.595\t\n", "CSCO\t2006-01-03\t17.33\t\n", "CSCO\t2005-01-03\t19.37\t\n", "CSCO\t2004-01-02\t24.305\t\n", "CSCO\t2003-01-02\t13.375\t\n", "CSCO\t2002-01-02\t18.835\t\n", "CSCO\t2001-01-02\t35.72\t\n", "CSCO\t2000-01-03\t109\t\n", "GOOG\t2009-01-02\t314.96\t\n", "GOOG\t2008-01-02\t689.03\t\n", "GOOG\t2007-01-03\t466.795\t\n", "GOOG\t2006-01-03\t428.875\t\n", "GOOG\t2005-01-03\t200.055\t\n", "MSFT\t2009-01-02\t19.93\t\n", "MSFT\t2008-01-02\t35.505\t\n", "MSFT\t2007-01-03\t29.885\t\n", "MSFT\t2006-01-03\t26.545\t\n", "MSFT\t2005-01-03\t26.77\t\n", "MSFT\t2004-01-02\t27.515\t\n", "MSFT\t2003-01-02\t53.01\t\n", "MSFT\t2002-01-02\t66.845\t\n", "MSFT\t2001-01-02\t43.755\t\n", "MSFT\t2000-01-03\t116.965\t\n", "YHOO\t2009-01-02\t12.51\t\n", "YHOO\t2008-01-02\t23.76\t\n", "YHOO\t2007-01-03\t25.73\t\n", "YHOO\t2006-01-03\t40.3\t\n", "YHOO\t2005-01-03\t38.27\t\n", "YHOO\t2004-01-02\t45.45\t\n", "YHOO\t2003-01-02\t17.095\t\n", "YHOO\t2002-01-02\t18.385\t\n", "YHOO\t2001-01-02\t29.25\t\n", "YHOO\t2000-01-03\t458.96\t\n" ] } ], "source": [ "cat stocks.txt | Rscript stock_day_avg.R" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### R version" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/12/07 15:41:48 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 30 minutes.\n", "Moved: 'hdfs://emr-header-1.cluster-41697:9000/user/lifeng/output' to trash at: hdfs://emr-header-1.cluster-41697:9000/user/lifeng/.Trash/Current\n", "18/12/07 15:41:49 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.\n", "18/12/07 15:41:50 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.\n", "18/12/07 15:41:50 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name\n", "packageJobJar: [stock_day_avg.R, /tmp/hadoop-unjar8447436429789349154/] [] /tmp/streamjob5401932799782783867.jar tmpDir=null\n", "18/12/07 15:41:50 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 15:41:51 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 15:41:51 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 15:41:51 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 15:41:51 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries\n", "18/12/07 15:41:51 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]\n", "18/12/07 15:41:51 INFO mapred.FileInputFormat: Total input paths to process : 1\n", "18/12/07 15:41:51 INFO mapreduce.JobSubmitter: number of splits:16\n", "18/12/07 15:41:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542711134746_0119\n", "18/12/07 15:41:51 INFO impl.YarnClientImpl: Submitted application application_1542711134746_0119\n", "18/12/07 15:41:51 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-41697:20888/proxy/application_1542711134746_0119/\n", "18/12/07 15:41:51 INFO mapreduce.Job: Running job: job_1542711134746_0119\n", "18/12/07 15:41:56 INFO mapreduce.Job: Job job_1542711134746_0119 running in uber mode : false\n", "18/12/07 15:41:56 INFO mapreduce.Job: map 0% reduce 0%\n", "18/12/07 15:42:03 INFO mapreduce.Job: map 100% reduce 0%\n", "18/12/07 15:42:08 INFO mapreduce.Job: map 100% reduce 100%\n", "18/12/07 15:42:08 INFO mapreduce.Job: Job job_1542711134746_0119 completed successfully\n", "18/12/07 15:42:08 INFO mapreduce.Job: Counters: 49\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=1068\n", "\t\tFILE: Number of bytes written=2258875\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\t\tHDFS: Number of bytes read=23344\n", "\t\tHDFS: Number of bytes written=1066\n", "\t\tHDFS: Number of read operations=51\n", "\t\tHDFS: Number of large read operations=0\n", "\t\tHDFS: Number of write operations=2\n", "\tJob Counters \n", "\t\tLaunched map tasks=16\n", "\t\tLaunched reduce tasks=1\n", "\t\tData-local map tasks=16\n", "\t\tTotal time spent by all maps in occupied slots (ms)=3672278\n", "\t\tTotal time spent by all reduces in occupied slots (ms)=243009\n", "\t\tTotal time spent by all map tasks (ms)=62242\n", "\t\tTotal time spent by all reduce tasks (ms)=2077\n", "\t\tTotal vcore-milliseconds taken by all map tasks=62242\n", "\t\tTotal vcore-milliseconds taken by all reduce tasks=2077\n", "\t\tTotal megabyte-milliseconds taken by all map tasks=116517024\n", "\t\tTotal megabyte-milliseconds taken by all reduce tasks=7776288\n", "\tMap-Reduce Framework\n", "\t\tMap input records=45\n", "\t\tMap output records=45\n", "\t\tMap output bytes=2557\n", "\t\tMap output materialized bytes=1752\n", "\t\tInput split bytes=1888\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=45\n", "\t\tReduce shuffle bytes=1752\n", "\t\tReduce input records=45\n", "\t\tReduce output records=45\n", "\t\tSpilled Records=90\n", "\t\tShuffled Maps =16\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=16\n", "\t\tGC time elapsed (ms)=2874\n", "\t\tCPU time spent (ms)=14450\n", "\t\tPhysical memory (bytes) snapshot=8774807552\n", "\t\tVirtual memory (bytes) snapshot=63769485312\n", "\t\tTotal committed heap usage (bytes)=12871794688\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Input Format Counters \n", "\t\tBytes Read=21456\n", "\tFile Output Format Counters \n", "\t\tBytes Written=1066\n", "18/12/07 15:42:08 INFO streaming.StreamJob: Output directory: ./output\n", "AAPL\t2000-01-03\t108.405\t\n", "AAPL\t2001-01-02\t14.88\t\n", "AAPL\t2002-01-02\t22.675\t\n", "AAPL\t2003-01-02\t14.58\t\n", "AAPL\t2004-01-02\t21.415\t\n", "AAPL\t2005-01-03\t64.035\t\n", "AAPL\t2006-01-03\t73.565\t\n", "AAPL\t2007-01-03\t85.045\t\n", "AAPL\t2008-01-02\t197.055\t\n", "AAPL\t2009-01-02\t88.315\t\n", "CSCO\t2000-01-03\t109\t\n", "CSCO\t2001-01-02\t35.72\t\n", "CSCO\t2002-01-02\t18.835\t\n", "CSCO\t2003-01-02\t13.375\t\n", "CSCO\t2004-01-02\t24.305\t\n", "CSCO\t2005-01-03\t19.37\t\n", "CSCO\t2006-01-03\t17.33\t\n", "CSCO\t2007-01-03\t27.595\t\n", "CSCO\t2008-01-02\t26.77\t\n", "CSCO\t2009-01-02\t16.685\t\n", "GOOG\t2005-01-03\t200.055\t\n", "GOOG\t2006-01-03\t428.875\t\n", "GOOG\t2007-01-03\t466.795\t\n", "GOOG\t2008-01-02\t689.03\t\n", "GOOG\t2009-01-02\t314.96\t\n", "MSFT\t2000-01-03\t116.965\t\n", "MSFT\t2001-01-02\t43.755\t\n", "MSFT\t2002-01-02\t66.845\t\n", "MSFT\t2003-01-02\t53.01\t\n", "MSFT\t2004-01-02\t27.515\t\n", "MSFT\t2005-01-03\t26.77\t\n", "MSFT\t2006-01-03\t26.545\t\n", "MSFT\t2007-01-03\t29.885\t\n", "MSFT\t2008-01-02\t35.505\t\n", "MSFT\t2009-01-02\t19.93\t\n", "YHOO\t2000-01-03\t458.96\t\n", "YHOO\t2001-01-02\t29.25\t\n", "YHOO\t2002-01-02\t18.385\t\n", "YHOO\t2003-01-02\t17.095\t\n", "YHOO\t2004-01-02\t45.45\t\n", "YHOO\t2005-01-03\t38.27\t\n", "YHOO\t2006-01-03\t40.3\t\n", "YHOO\t2007-01-03\t25.73\t\n", "YHOO\t2008-01-02\t23.76\t\n", "YHOO\t2009-01-02\t12.51\t\n" ] } ], "source": [ "sh run_stocks_mean.sh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Python version" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#! /usr/bin/env python2\n", "\n", "import sys\n", "\n", "for line in sys.stdin:\n", " part = line.split(',') \n", " print (part[0], 1)\n", " \n" ] } ], "source": [ "cat stocks_mapper.py" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#! /usr/bin/env python3\n", "\n", "import sys\n", "from operator import itemgetter \n", "wordcount = {}\n", "\n", "for line in sys.stdin:\n", " word,count = line.split(' ')\n", " count = int(count)\n", " wordcount[word] = wordcount.get(word,0) + count\n", "\n", "sorted_wordcount = sorted(wordcount.items(), key=itemgetter(0))\n", "\n", "for word, count in sorted_wordcount:\n", " print ('%s\\t%s'% (word,count))\n" ] } ], "source": [ "cat stocks_reducer.py" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/12/07 15:22:39 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 30 minutes.\n", "Moved: 'hdfs://emr-header-1.cluster-41697:9000/user/lifeng/output' to trash at: hdfs://emr-header-1.cluster-41697:9000/user/lifeng/.Trash/Current\n", "18/12/07 15:22:40 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.\n", "18/12/07 15:22:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.\n", "18/12/07 15:22:41 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name\n", "packageJobJar: [stocks_mapper.py, stocks_reducer.py, /tmp/hadoop-unjar6949997625926419144/] [] /tmp/streamjob8221192035336864415.jar tmpDir=null\n", "18/12/07 15:22:41 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 15:22:42 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 15:22:42 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 15:22:42 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 15:22:42 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries\n", "18/12/07 15:22:42 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]\n", "18/12/07 15:22:42 INFO mapred.FileInputFormat: Total input paths to process : 1\n", "18/12/07 15:22:42 INFO mapreduce.JobSubmitter: number of splits:16\n", "18/12/07 15:22:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542711134746_0103\n", "18/12/07 15:22:42 INFO impl.YarnClientImpl: Submitted application application_1542711134746_0103\n", "18/12/07 15:22:42 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-41697:20888/proxy/application_1542711134746_0103/\n", "18/12/07 15:22:42 INFO mapreduce.Job: Running job: job_1542711134746_0103\n", "18/12/07 15:22:48 INFO mapreduce.Job: Job job_1542711134746_0103 running in uber mode : false\n", "18/12/07 15:22:48 INFO mapreduce.Job: map 0% reduce 0%\n", "18/12/07 15:22:54 INFO mapreduce.Job: map 100% reduce 0%\n", "18/12/07 15:22:59 INFO mapreduce.Job: map 100% reduce 67%\n", "18/12/07 15:23:00 INFO mapreduce.Job: map 100% reduce 100%\n", "18/12/07 15:23:00 INFO mapreduce.Job: Job job_1542711134746_0103 completed successfully\n", "18/12/07 15:23:00 INFO mapreduce.Job: Counters: 50\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=117\n", "\t\tFILE: Number of bytes written=2529253\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\t\tHDFS: Number of bytes read=23344\n", "\t\tHDFS: Number of bytes written=39\n", "\t\tHDFS: Number of read operations=57\n", "\t\tHDFS: Number of large read operations=0\n", "\t\tHDFS: Number of write operations=6\n", "\tJob Counters \n", "\t\tKilled reduce tasks=1\n", "\t\tLaunched map tasks=16\n", "\t\tLaunched reduce tasks=3\n", "\t\tData-local map tasks=16\n", "\t\tTotal time spent by all maps in occupied slots (ms)=3827094\n", "\t\tTotal time spent by all reduces in occupied slots (ms)=711711\n", "\t\tTotal time spent by all map tasks (ms)=64866\n", "\t\tTotal time spent by all reduce tasks (ms)=6083\n", "\t\tTotal vcore-milliseconds taken by all map tasks=64866\n", "\t\tTotal vcore-milliseconds taken by all reduce tasks=6083\n", "\t\tTotal megabyte-milliseconds taken by all map tasks=121429152\n", "\t\tTotal megabyte-milliseconds taken by all reduce tasks=22774752\n", "\tMap-Reduce Framework\n", "\t\tMap input records=45\n", "\t\tMap output records=45\n", "\t\tMap output bytes=360\n", "\t\tMap output materialized bytes=900\n", "\t\tInput split bytes=1888\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=5\n", "\t\tReduce shuffle bytes=900\n", "\t\tReduce input records=45\n", "\t\tReduce output records=5\n", "\t\tSpilled Records=90\n", "\t\tShuffled Maps =48\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=48\n", "\t\tGC time elapsed (ms)=3061\n", "\t\tCPU time spent (ms)=15520\n", "\t\tPhysical memory (bytes) snapshot=9426890752\n", "\t\tVirtual memory (bytes) snapshot=74381381632\n", "\t\tTotal committed heap usage (bytes)=14251720704\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Input Format Counters \n", "\t\tBytes Read=21456\n", "\tFile Output Format Counters \n", "\t\tBytes Written=39\n", "18/12/07 15:23:00 INFO streaming.StreamJob: Output directory: ./output\n", "AAPL\t10\n", "YHOO\t10\n", "CSCO\t10\n", "MSFT\t10\n", "GOOG\t5\n" ] } ], "source": [ "sh run_stocks.sh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Linear Regression with Hadoop\n", "\n", "### Let's first generate some data" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#! /usr/bin/Rscript\n", "\n", "n = 100000\n", "p = 10\n", "x = matrix(rnorm(n*p), n, p)\n", "e = rnorm(n)\n", "beta = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) #生成beta系数分别为1~10\n", "y = x%*%beta+0.3*e\n", "mydata = cbind(x, y)\n", "dim(mydata)\n", "write.table(mydata, \"linear_random.csv\", sep = \",\" , row.names = FALSE, col.names = FALSE)\n", "colnames(mydata) = c(\"x1\", \"x2\", \"x3\", \"x4\", \"x5\", \"x6\", \"x7\", \"x8\", \"x9\", \"x10\", \"y\")\n", "mydata = data.frame(mydata)\n", "myfit = lm(y~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, mydata)\n", "myfit$coefficients\n" ] } ], "source": [ "cat simulation.R" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 100000 11\n", " (Intercept) x1 x2 x3 x4 x5 \n", " 0.001462286 0.999006118 1.998435334 2.999921614 4.000487524 4.999367984 \n", " x6 x7 x8 x9 x10 \n", " 5.999635871 7.000364838 7.999219909 8.999339902 10.000583860 \n" ] } ], "source": [ "Rscript simulation.R" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-1.25783985738839,0.233853066199809,0.959321896002629,-0.927971998903392,-1.9081222471712,-1.15679780312436,-0.98285146450708,-0.833097463331552,0.305515852568209,-1.79625721854489,-47.3338205169692\n", "-0.620172983975508,1.73728982281345,-0.829302997285467,0.354197934959032,0.85902952682553,0.489616646142072,0.0563573499545251,-0.107110454558573,0.189569420283467,-0.500661238817971,5.24542404493916\n", "1.40735729353374,1.25565108871813,0.988131408681291,0.857230185753291,0.63206801604377,2.16416156314474,-0.644790848920365,-0.51230132081169,1.94601333805292,-0.0724748987249774,34.2115745174301\n", "-0.0423011414422383,1.60278591223843,0.670382006567984,0.133106694554098,0.351700062468421,0.129326484274798,-0.464781051242584,0.764922912370213,-0.419752110275864,1.07624550982786,17.5337240754468\n", "0.662088621989811,-0.554459357346132,-0.83952037843494,1.43008790985413,1.40697432960699,-0.17754879747715,0.0269668971687026,-1.10737416485557,-1.14655494403846,-0.825552889554892,-18.7422354539862\n", "-0.605091503904831,1.08722575798844,1.12151767015535,-0.467934606210984,-1.1702020092535,-0.443650557748794,-0.802428089144584,-0.703702959432261,0.765019193846429,0.950967476748649,-0.369219523421802\n", "1.45255921490658,0.596968957864977,-1.38110209866925,-1.22015325783075,1.3321747339368,-0.618288146956072,-0.382441706582264,0.207848059847487,0.889489099391418,-1.38471032943358,-10.751906812882\n", "-0.0794402496054108,-0.0501255662315621,-0.0350709967052265,-1.00325836595628,-1.63010974782501,-0.00656109346693426,-0.710238219439852,1.55369135240367,-0.655536261944941,0.27165497553634,-8.87422553928433\n", "1.82499632174219,-1.37138612590107,0.66600446984248,2.27965171364776,0.577535835916579,3.0670970849713,0.475444508651725,0.252335257217842,0.394343365860688,1.26731818217592,52.5594762829493\n", "0.569523756632114,0.0236995546981009,0.185252906422781,0.955783376629012,0.114066408572127,-0.906101655277262,-0.87515532316799,1.24224011041282,-1.95110207141076,-0.172283404356755,-15.4308746142398\n" ] } ], "source": [ "head linear_random.csv" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Try with Linux pipes first" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "matrix([[ 0.99901524, 1.99843097, 2.99992025, 4.00048095,\n", " 4.99937017, 5.99963222, 7.00036961, 7.99921836,\n", " 8.99933589, 10.00058861]])\n" ] } ], "source": [ "cat linear_random.csv | python lr_mapper.py | python lr_reducer.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Run the regression model within Hadoop" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "scrolled": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18/12/07 16:39:28 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 30 minutes.\n", "Moved: 'hdfs://emr-header-1.cluster-41697:9000/user/lifeng/output' to trash at: hdfs://emr-header-1.cluster-41697:9000/user/lifeng/.Trash/Current\n", "18/12/07 16:39:29 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.\n", "packageJobJar: [lr_mapper.py, lr_reducer.py, /tmp/hadoop-unjar4917789017892634208/] [] /tmp/streamjob4579007115660462517.jar tmpDir=null\n", "18/12/07 16:39:30 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 16:39:30 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 16:39:31 INFO impl.TimelineClientImpl: Timeline service address: http://emr-header-1.cluster-41697:8188/ws/v1/timeline/\n", "18/12/07 16:39:31 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-41697/192.168.0.219:8032\n", "18/12/07 16:39:31 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries\n", "18/12/07 16:39:31 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]\n", "18/12/07 16:39:31 INFO mapred.FileInputFormat: Total input paths to process : 1\n", "18/12/07 16:39:31 INFO mapreduce.JobSubmitter: number of splits:16\n", "18/12/07 16:39:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542711134746_0136\n", "18/12/07 16:39:31 INFO impl.YarnClientImpl: Submitted application application_1542711134746_0136\n", "18/12/07 16:39:31 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-41697:20888/proxy/application_1542711134746_0136/\n", "18/12/07 16:39:31 INFO mapreduce.Job: Running job: job_1542711134746_0136\n", "18/12/07 16:39:36 INFO mapreduce.Job: Job job_1542711134746_0136 running in uber mode : false\n", "18/12/07 16:39:36 INFO mapreduce.Job: map 0% reduce 0%\n", "18/12/07 16:39:43 INFO mapreduce.Job: map 100% reduce 0%\n", "18/12/07 16:39:47 INFO mapreduce.Job: map 100% reduce 100%\n", "18/12/07 16:39:47 INFO mapreduce.Job: Job job_1542711134746_0136 completed successfully\n", "18/12/07 16:39:48 INFO mapreduce.Job: Counters: 50\n", "\tFile System Counters\n", "\t\tFILE: Number of bytes read=12815\n", "\t\tFILE: Number of bytes written=2286899\n", "\t\tFILE: Number of read operations=0\n", "\t\tFILE: Number of large read operations=0\n", "\t\tFILE: Number of write operations=0\n", "\t\tHDFS: Number of bytes read=19966275\n", "\t\tHDFS: Number of bytes written=172\n", "\t\tHDFS: Number of read operations=51\n", "\t\tHDFS: Number of large read operations=0\n", "\t\tHDFS: Number of write operations=2\n", "\tJob Counters \n", "\t\tKilled map tasks=1\n", "\t\tLaunched map tasks=16\n", "\t\tLaunched reduce tasks=1\n", "\t\tData-local map tasks=16\n", "\t\tTotal time spent by all maps in occupied slots (ms)=4250124\n", "\t\tTotal time spent by all reduces in occupied slots (ms)=227331\n", "\t\tTotal time spent by all map tasks (ms)=72036\n", "\t\tTotal time spent by all reduce tasks (ms)=1943\n", "\t\tTotal vcore-milliseconds taken by all map tasks=72036\n", "\t\tTotal vcore-milliseconds taken by all reduce tasks=1943\n", "\t\tTotal megabyte-milliseconds taken by all map tasks=134851392\n", "\t\tTotal megabyte-milliseconds taken by all reduce tasks=7274592\n", "\tMap-Reduce Framework\n", "\t\tMap input records=100000\n", "\t\tMap output records=16\n", "\t\tMap output bytes=35143\n", "\t\tMap output materialized bytes=13082\n", "\t\tInput split bytes=2000\n", "\t\tCombine input records=0\n", "\t\tCombine output records=0\n", "\t\tReduce input groups=16\n", "\t\tReduce shuffle bytes=13082\n", "\t\tReduce input records=16\n", "\t\tReduce output records=3\n", "\t\tSpilled Records=32\n", "\t\tShuffled Maps =16\n", "\t\tFailed Shuffles=0\n", "\t\tMerged Map outputs=16\n", "\t\tGC time elapsed (ms)=2891\n", "\t\tCPU time spent (ms)=16690\n", "\t\tPhysical memory (bytes) snapshot=8780718080\n", "\t\tVirtual memory (bytes) snapshot=63781085184\n", "\t\tTotal committed heap usage (bytes)=12934709248\n", "\tShuffle Errors\n", "\t\tBAD_ID=0\n", "\t\tCONNECTION=0\n", "\t\tIO_ERROR=0\n", "\t\tWRONG_LENGTH=0\n", "\t\tWRONG_MAP=0\n", "\t\tWRONG_REDUCE=0\n", "\tFile Input Format Counters \n", "\t\tBytes Read=19964275\n", "\tFile Output Format Counters \n", "\t\tBytes Written=172\n", "18/12/07 16:39:48 INFO streaming.StreamJob: Output directory: ./output\n", "matrix([[ 0.99901524, 1.99843097, 2.99992025, 4.00048095,\t\n", " 4.99937017, 5.99963222, 7.00036961, 7.99921836,\t\n", " 8.99933589, 10.00058861]])\t\n" ] } ], "source": [ "sh run_lr.sh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lab\n", "\n", "- Use `airline_small.csv` as input. The data description is available at http://stat-computing.org/dataexpo/2009/the-data.html\n", "\n", "- Extract useful information from the data\n", "\n", " - List all airport codes, with frequency\n", " - Make a new binary variable (Y) to indicate if a trip is delayed or not.\n", " \n", "- Make dummy transformation for variables such as `DayofWeek`, `Month`...\n", "\n", "- Finally, save your output in a file.\n", "\n", " - Each row contains the binary variable (Y), `CarrierDelay`, and your constructed dummy variables as predictors.\n", " - If possible, save the output in a [`libsvm` sparse format](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html#sklearn.datasets.dump_svmlight_file) to save space.\n", " \n", "\n", "- **Hint**\n", "\n", " - You could use any language but Python3.7 is preferable.\n", " - Try your code with pipe first and then Hadoop\n", " - Record the computing time." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }