Hadoop: Latest Hadoop Interview Questions -Part 6

1. What is Hadoop ?

• Apache Hadoop is a software framework (open source) which promotes data-intensive distributed applications.

• The entire Hadoop platform consists of Hadoop kernal, MapReduce component, HDFS (Hadoop distributed file system)

• Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.

• The most well known technology used for Big Data is Hadoop

• Two languages are identified as original Hadoop languages: PIG and Hive.

• In hadoop system, the data is distributed in thousands of nodes parallely

• Hadoop deals with complexities of high volume, velocity & variety of data

• Batch processing centric is greatly achieved in Hadoop

• Hadoop can store petabytes of data reliably

• Accessibility is ensured even if any machine breaks down or is thrown out from network.

• One can use Map Reduce programs to access and manipulate the data. The developer need not worry where the data is stored, he/she can reference the data from a single view provided from the Master Node which stores all metadata of all the files stored across the cluster.

2. What is Big Data?

Big Data is large in quantity, is captured at a rapid rate, and is structured or unstructured, or some combination of the above. It is difficult to capture, mine, and manage data using traditional methods but not in Big data. There is so much hype in this space that there could be an extended debate just about the definition of big data.

Big Data technology is not restricted to large volumes. As of the year2012, clusters that are big are in the 100 Petabyte range.

Traditional relational databases,like Informix and DB2, provide proven solutions for structured data. Via extensibility they also manage unstructured data. The Hadoop technology brings new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.

3. Advantages of Hadoop

• Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.

• Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.

• Linear Scalability: Every parallel technology makes claims about scale up.Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.

• Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop programming model does not solve all problems, but it is a strong solution for many tasks.

4. Definition of Big data

According to Gartner, Big data can be defined as high volume, velocity and variety information requiring innovative and cost effective forms of information processing for enhanced decision making.

5. How Big data differs from database ?

Datasets which are beyond the ability of the database to store, analyze and manage can be defined as Big. The technology extracts required information from large volume whereas the storage area is limited for a database.

6. 3 V of Big data - Explain (Important)

Big data can be defined with the help of 3 V (Volume, Velocity and Variety).

Volume: It describes the amount of data that is generated by organizations or individuals. Thus, it denotes the storage area limit.

Velocity: It describes the frequency at which the data is generated, changed, processed and shared. Thus, it denotes any access to the data in a specified time.

Variety: The data can be Structured or Unstructured or Semi-structured data.

The above 3 V were sufficient to define big data. But nowadays one more V (Value) is defined.

Value: It is the outcome or ability of analysing big data which will leverage the business.

7. Who are all using Hadoop? Give some examples.

• A9.com
• Amazon
• Adobe
• AOL
• Baidu
• Cooliris
• Facebook
• NSF-Google
• IBM
• LinkedIn
• Ning
• PARC
• Rackspace
• StumbleUpon
• Twitter
• Yahoo!

8. Hadoop Stack - Structure

9. Pig for Hadoop - Give some points

Pig is Data-flow oriented language for analyzing large data sets.
It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility.
Users can create their own functions to do special-purpose processing.

Features of Pig:

– data transformation functions
– datatypes include sets, associative arrays, tuples
– high-level language for marshalling data
- developed at yahoo!

10. Hive for Hadoop - Give some points

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Keypoints:
• SQL-based data warehousing application
– features similar to Pig
– more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY,etc
• Analyzing very large data sets
– log processing, text mining, document indexing
• Developed at Facebook

Hadoop

Latest Hadoop Interview Questions -Part 6