131. Explain the Reducer’s
reduce phase?
Ans: In this phase the reduce(MapOutKeyType, Iterable,
Context) method is called for
each pair in the grouped inputs. The output of the reduce
task is typically written to the
FileSystem via Context.write(ReduceOutKeyType,
ReduceOutValType). Applications
can use the Context to report progress, set application-level
status messages and update
Counters, or just indicate that they are alive. The output
of the Reducer is not sorted.
132. How many Reducers should be configured?
Ans: The right number of reduces seems to be 0.95 or 1.75
multiplied by (<no. of
nodes> * mapreduce.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and
start transfering map outputs as
the maps finish. With 1.75 the faster nodes will finish
their first round of reduces and
launch a second wave of reduces doing a much better job of
load balancing. Increasing
the number of reduces increases the framework overhead, but
increases load balancing
and lowers the cost of failures.
133. It can be possible that a Job has 0 reducers?
Ans: It is legal to set the number of reduce-tasks to zero
if no reduction is desired.
134. What happens if number of reducers are 0?
Ans: In this case the outputs of the map-tasks go directly
to the FileSystem, into the
output path set by setOutputPath(Path). The framework does
not sort the map-outputs
before writing them out to the FileSystem.
135. How many instances of JobTracker can run on a Hadoop
Cluser?
Ans: Only one
136. What is the JobTracker and what it performs in a Hadoop
Cluster?
Ans: JobTracker is a daemon service which submits and tracks
the MapReduce
tasks to the Hadoop cluster. It runs its own JVM process.
And usually it run on a
separate machine, and each slave node is configured with job
tracker node
location.
The JobTracker is single point of failure for the Hadoop
MapReduce service. If it
goes down, all running jobs are halted.
JobTracker in Hadoop performs following actions
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the
location of the data
The JobTracker locates TaskTracker nodes with available
slots at or near the
data
The JobTracker submits the work to the chosen TaskTracker
nodes
The TaskTracker nodes are monitored. If they do not submit
heartbeat signals
often enough, they are deemed to have failed and the work is
scheduled on a
different TaskTracker.
A TaskTracker will notify the JobTracker when a task
fails. The JobTracker
decides what to do then: it may resubmit the job elsewhere,
it may mark that
specific record as something to avoid, and it may may even
blacklist the
TaskTracker as unreliable.
When the work is completed, the JobTracker updates its
status.
Client applications can poll the JobTracker for
information.
137. How a task is scheduled by a JobTracker?
Ans: The TaskTrackers send out heartbeat messages to the
JobTracker, usually
every few minutes, to reassure the JobTracker that it is
still alive. These
messages also inform the JobTracker of the number of
available slots, so the
JobTracker can stay up to date with where in the cluster
work can be delegated.
When the JobTracker tries to find somewhere to schedule a
task within the
MapReduce operations, it first looks for an empty slot on
the same server that
hosts the DataNode containing the data, and if not, it looks
for an empty slot on a
machine in the same rack.
138. How many instances of Tasktracker run on a Hadoop
cluster?
Ans: There is one Daemon Tasktracker process for each slave
node in the
Hadoop cluster.
139. What are the two main parts of the Hadoop framework?
Ans: Hadoop consists of two main parts
Hadoop distributed file system, a distributed file system
with high throughput,
Hadoop MapReduce, a software framework for processing
large data sets.
140. Explain the use of TaskTracker in the Hadoop cluster?
Ans: A Tasktracker is a slave node in the cluster which that
accepts the tasks
from JobTracker like Map, Reduce or shuffle operation.
Tasktracker also runs in
its own JVM Process.
Every TaskTracker is configured with a set of slots; these
indicate the number of
tasks that it can accept. The TaskTracker starts a separate
JVM processes to do
the actual work (called as Task Instance) this is to ensure
that process failure
does not take down the task tracker.©Hadoop Learning
Resources (Note: PappuPass.com is changed to HadoopExam.com) 13
The Tasktracker monitors these task instances, capturing the
output and exit
codes. When the Task instances finish, successfully or not,
the task tracker
notifies the JobTracker.
The TaskTrackers also send out heartbeat messages to the
JobTracker, usually
every few minutes, to reassure the JobTracker that it is
still alive. These
messages also inform the JobTracker of the number of
available slots, so the
JobTracker can stay up to date with where in the cluster
work can be delegated.