What are the Side Data Distribution Techniques
Side data refers to extra static small data required by map reduce to
perform job. Main challenge is the availability of side data on the node where
the map would be executed. Hadoop provides two side data distribution
techniques.
Using Job
Configuration
An arbitrary Key value pair can be
set in job configuration. very useful technique in case of
small file. Suggested size of file to keep in configuration object is in
KBs.Because conf object would be read by job tracker, task tracker and new
child jvm. this would increase overhead at every front. A part from this side
data would require serialization if it has non-primitive encoding.
Distributed Cache
Rather than serializing side data in the job configuration, it is
preferable to distribute datasets using Hadoop’s distributed cache mechanism.
This provides a service for copying files and archives to the task nodes in
time for the tasks to use them when they run. To save network bandwidth, files
are normally copied to any particular node once per job.
What is shuffleing
in mapreduce?
Once map
tasks started to complete, A communication from reducers is started. where map
output sents to reducer, which is looking for the output data to process. at
same time data nodes are still process multiple other tasks. The data transfer
of mappers output to reducer known as shuffling.
What is
partitioning?
Partitioning
is a process to identify the reducer instance which would be used to supply the
mappers output. Before mapper emits the data (Key Value) pair to reducer,
mapper identify the reducer as an recipient of mapper output. All the key, no
matter which mapper has generated this, must lie with same reducer.
Can we
change the file cached by DistributedCache
No,
DistributedCache tracks the caching with timestamp. cached file should not be
changed during the job execution.
What is
Distributed Cache in mapreduce framework?
Distributed cache is an important feature provide by map reduce
framework. Distributed cache can cache text, archive, jars which could be used
by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework
would copy the specified
file to data node before processing the job. Framework copy file only once for
each job, and has the ability of archival. Application needs to specify the
file path via http:// or hdfs:// to cache.
What is
speculative execution in Hadoop?
It
becomes very important when dealing with a very large cluster. Lets assume if
you have thousands of machine in your cluster, and one of your data node is not
performing well in comparison to others. It would degrade the over all
performance of a job executed by whole cluster. Speculative execution is
technique, hadoop runs multiple copies of MR task on other data nodes. Which
machine would finish the execution of task, will be consider for result.
What if job
tracker machine is down?
Single point failure from execution point of
view.
Can we
deploy job tracker other than name node?
Yes, in production it is highly recommended.
For self development and learning you may setup according to your need.
What is a
task tracker?
Task tracker is actual
component which deployed the mapreduce jar on data nodes and responsible to
execute the task given to mapreduce. It continuously executes the task and send
updated report to job tracker.
What is a
job tracker?
Job
tracker is a background service executed on namenode for submitting and
tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further
break up the job into tasks. Which would be deployed every data node holding
the required data. In a Hadoop cluster, Job tracker is master and task
acts like child, acts, performs and revert the progress to job tracker through
heartbeat.