Latest Hadoop Interview Questions -Part 13

131. Explain the Reducer’s reduce phase?
Ans: In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for
each pair in the grouped inputs. The output of the reduce task is typically written to the
FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType). Applications
can use the Context to report progress, set application-level status messages and update
Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

132. How many Reducers should be configured?
Ans: The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of
nodes> * mapreduce.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start transfering map outputs as
the maps finish. With 1.75 the faster nodes will finish their first round of reduces and
launch a second wave of reduces doing a much better job of load balancing. Increasing
the number of reduces increases the framework overhead, but increases load balancing
and lowers the cost of failures.

133. It can be possible that a Job has 0 reducers?
Ans: It is legal to set the number of reduce-tasks to zero if no reduction is desired.

134. What happens if number of reducers are 0?
Ans: In this case the outputs of the map-tasks go directly to the FileSystem, into the
output path set by setOutputPath(Path). The framework does not sort the map-outputs
before writing them out to the FileSystem.

135. How many instances of JobTracker can run on a Hadoop Cluser?
Ans: Only one

136. What is the JobTracker and what it performs in a Hadoop Cluster?
Ans: JobTracker is a daemon service which submits and tracks the MapReduce
tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a
separate machine, and each slave node is configured with job tracker node
The JobTracker is single point of failure for the Hadoop MapReduce service. If it
goes down, all running jobs are halted.
JobTracker in Hadoop performs following actions
 Client applications submit jobs to the Job tracker.
 The JobTracker talks to the NameNode to determine the location of the data
 The JobTracker locates TaskTracker nodes with available slots at or near the
 The JobTracker submits the work to the chosen TaskTracker nodes
 The TaskTracker nodes are monitored. If they do not submit heartbeat signals
often enough, they are deemed to have failed and the work is scheduled on a
different TaskTracker.
 A TaskTracker will notify the JobTracker when a task fails. The JobTracker
decides what to do then: it may resubmit the job elsewhere, it may mark that
specific record as something to avoid, and it may may even blacklist the
TaskTracker as unreliable.
 When the work is completed, the JobTracker updates its status.
 Client applications can poll the JobTracker for information.

137. How a task is scheduled by a JobTracker?
Ans: The TaskTrackers send out heartbeat messages to the JobTracker, usually
every few minutes, to reassure the JobTracker that it is still alive. These
messages also inform the JobTracker of the number of available slots, so the
JobTracker can stay up to date with where in the cluster work can be delegated.
When the JobTracker tries to find somewhere to schedule a task within the
MapReduce operations, it first looks for an empty slot on the same server that
hosts the DataNode containing the data, and if not, it looks for an empty slot on a
machine in the same rack.

138. How many instances of Tasktracker run on a Hadoop cluster?
Ans: There is one Daemon Tasktracker process for each slave node in the
Hadoop cluster.

139. What are the two main parts of the Hadoop framework?
Ans: Hadoop consists of two main parts
 Hadoop distributed file system, a distributed file system with high throughput,
 Hadoop MapReduce, a software framework for processing large data sets.

140. Explain the use of TaskTracker in the Hadoop cluster?
Ans: A Tasktracker is a slave node in the cluster which that accepts the tasks
from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in
its own JVM Process.
Every TaskTracker is configured with a set of slots; these indicate the number of
tasks that it can accept. The TaskTracker starts a separate JVM processes to do
the actual work (called as Task Instance) this is to ensure that process failure
does not take down the task tracker.©Hadoop Learning Resources (Note: is changed to 13
The Tasktracker monitors these task instances, capturing the output and exit
codes. When the Task instances finish, successfully or not, the task tracker
notifies the JobTracker.
The TaskTrackers also send out heartbeat messages to the JobTracker, usually
every few minutes, to reassure the JobTracker that it is still alive. These
messages also inform the JobTracker of the number of available slots, so the

JobTracker can stay up to date with where in the cluster work can be delegated.

No comments:

Post a Comment