Hadoop: October 2013

What are the Side Data Distribution Techniques

Side data refers to extra static small data required by map reduce to perform job. Main challenge is the availability of side data on the node where the map would be executed. Hadoop provides two side data distribution techniques.

Using Job Configuration

An arbitrary Key value pair can be set in job configuration. very useful technique in case of small file. Suggested size of file to keep in configuration object is in KBs.Because conf object would be read by job tracker, task tracker and new child jvm. this would increase overhead at every front. A part from this side data would require serialization if it has non-primitive encoding.

Distributed Cache

Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job.

What is shuffleing in mapreduce?

Once map tasks started to complete, A communication from reducers is started. where map output sents to reducer, which is looking for the output data to process. at same time data nodes are still process multiple other tasks. The data transfer of mappers output to reducer known as shuffling.

What is partitioning?

Partitioning is a process to identify the reducer instance which would be used to supply the mappers output. Before mapper emits the data (Key Value) pair to reducer, mapper identify the reducer as an recipient of mapper output. All the key, no matter which mapper has generated this, must lie with same reducer.

Can we change the file cached by DistributedCache

No, DistributedCache tracks the caching with timestamp. cached file should not be changed during the job execution.

What is Distributed Cache in mapreduce framework?

Distributed cache is an important feature provide by map reduce framework. Distributed cache can cache text, archive, jars which could be used by application to improve performance. Application provide details of file to jobconf object to cache. Mapreduce framework would copy the specified file to data node before processing the job. Framework copy file only once for each job, and has the ability of archival. Application needs to specify the file path via http:// or hdfs:// to cache.

What is speculative execution in Hadoop?

It becomes very important when dealing with a very large cluster. Lets assume if you have thousands of machine in your cluster, and one of your data node is not performing well in comparison to others. It would degrade the over all performance of a job executed by whole cluster. Speculative execution is technique, hadoop runs multiple copies of MR task on other data nodes. Which machine would finish the execution of task, will be consider for result.

What if job tracker machine is down?

Single point failure from execution point of view.

Can we deploy job tracker other than name node?

Yes, in production it is highly recommended. For self development and learning you may setup according to your need.

What is a task tracker?

Task tracker is actual component which deployed the mapreduce jar on data nodes and responsible to execute the task given to mapreduce. It continuously executes the task and send updated report to job tracker.

What is a job tracker?

Job tracker is a background service executed on namenode for submitting and tracking a Job. Job in hadoop terminology refers to mapreduce jobs. It further break up the job into tasks. Which would be deployed every data node holding the required data. In a Hadoop cluster, Job tracker is master and task acts like child, acts, performs and revert the progress to job tracker through heartbeat.