MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort, and transfers the map outputs to the reducers as inputs is known as the shuffle.
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
The keys are the line offsets within the file, which we ignore in our map function.
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
Input Data: Raw data with temperatures records
Mapper: pull out the year and the air temperature.
(1950, 0)
(1950, 22)
(1950, -11)
(1949, 111)
(1949, 78)
...
(1949, [111, 78])
(1950, [0, 22, −11])
...
(1949, 111)
(1950, 22)
...
You could test your MapReduce locally with Linux
It simulates a single-map-and-single-reduce task.
cat sample_input.txt | mapper.py | sort | reducer.py