Hadoop: Latest Hadoop Interview Questions -Part 13

131. Explain the Reducer’s reduce phase?

Ans: In this phase the reduce(MapOutKeyType, Iterable, Context) method is called for

each pair in the grouped inputs. The output of the reduce task is typically written to the

FileSystem via Context.write(ReduceOutKeyType, ReduceOutValType). Applications

can use the Context to report progress, set application-level status messages and update

Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

132. How many Reducers should be configured?

Ans: The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of

nodes> * mapreduce.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transfering map outputs as

the maps finish. With 1.75 the faster nodes will finish their first round of reduces and

launch a second wave of reduces doing a much better job of load balancing. Increasing

the number of reduces increases the framework overhead, but increases load balancing

and lowers the cost of failures.

133. It can be possible that a Job has 0 reducers?

Ans: It is legal to set the number of reduce-tasks to zero if no reduction is desired.

134. What happens if number of reducers are 0?

Ans: In this case the outputs of the map-tasks go directly to the FileSystem, into the

output path set by setOutputPath(Path). The framework does not sort the map-outputs

before writing them out to the FileSystem.

135. How many instances of JobTracker can run on a Hadoop Cluser?

Ans: Only one

136. What is the JobTracker and what it performs in a Hadoop Cluster?

Ans: JobTracker is a daemon service which submits and tracks the MapReduce

tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a

separate machine, and each slave node is configured with job tracker node

location.

The JobTracker is single point of failure for the Hadoop MapReduce service. If it

goes down, all running jobs are halted.

JobTracker in Hadoop performs following actions

 Client applications submit jobs to the Job tracker.

 The JobTracker talks to the NameNode to determine the location of the data

 The JobTracker locates TaskTracker nodes with available slots at or near the

data

 The JobTracker submits the work to the chosen TaskTracker nodes

 The TaskTracker nodes are monitored. If they do not submit heartbeat signals

often enough, they are deemed to have failed and the work is scheduled on a

different TaskTracker.

 A TaskTracker will notify the JobTracker when a task fails. The JobTracker

decides what to do then: it may resubmit the job elsewhere, it may mark that

specific record as something to avoid, and it may may even blacklist the

TaskTracker as unreliable.

 When the work is completed, the JobTracker updates its status.

 Client applications can poll the JobTracker for information.

137. How a task is scheduled by a JobTracker?

Ans: The TaskTrackers send out heartbeat messages to the JobTracker, usually

every few minutes, to reassure the JobTracker that it is still alive. These

messages also inform the JobTracker of the number of available slots, so the

JobTracker can stay up to date with where in the cluster work can be delegated.

When the JobTracker tries to find somewhere to schedule a task within the

MapReduce operations, it first looks for an empty slot on the same server that

hosts the DataNode containing the data, and if not, it looks for an empty slot on a

machine in the same rack.

138. How many instances of Tasktracker run on a Hadoop cluster?

Ans: There is one Daemon Tasktracker process for each slave node in the

Hadoop cluster.

139. What are the two main parts of the Hadoop framework?

Ans: Hadoop consists of two main parts

 Hadoop distributed file system, a distributed file system with high throughput,

 Hadoop MapReduce, a software framework for processing large data sets.

140. Explain the use of TaskTracker in the Hadoop cluster?

Ans: A Tasktracker is a slave node in the cluster which that accepts the tasks

from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in

its own JVM Process.

Every TaskTracker is configured with a set of slots; these indicate the number of

tasks that it can accept. The TaskTracker starts a separate JVM processes to do

the actual work (called as Task Instance) this is to ensure that process failure

The Tasktracker monitors these task instances, capturing the output and exit

codes. When the Task instances finish, successfully or not, the task tracker

notifies the JobTracker.

The TaskTrackers also send out heartbeat messages to the JobTracker, usually

every few minutes, to reassure the JobTracker that it is still alive. These

messages also inform the JobTracker of the number of available slots, so the

JobTracker can stay up to date with where in the cluster work can be delegated.

Hadoop

Latest Hadoop Interview Questions -Part 13

No comments:

Post a Comment