26 Data Intensive Computing

 Hello students,

In this last blog of Unit 3, we look at Data Intensive Computing.

Data intensive computing are types of computing problems where data is very huge in numbers and size from few mega bytes to tera bytes to peta bytes to zeta bytes to exa bytes. Such data sizes are called to be Big Data. 

There has been data explosion in last many years ever since technology has become mature and advanced. 

Relational model based databases made up of tables, records, keys and relationship are not able to  manage Big Data due to huge amount of exponentially growing data and due to its strict structured format and ACID restrictions.

NoSQL databases was the answer to Big Data, to all data that was huge in numbers, were unstructured or semi structured. 

Big data could be split horizontally as well as vertically into managed datasets, which could be stored in repositories as required. 

One could query the datasets to know its content using meta data attributes associated with these datasets.

Any Data-Intensive application needs the following 8 things to perform extremely well: 

  1. Scalable Algorithm
  2. Metadata Technology

  3. Computing Platform

  4. Distributed File System

  5. Data Signature Generation

  6. Software Mobility

  7. Hybrid Interconnection Architectures

  8. Software Integration

Data-Intensive Computing can be driven using Grids, Clusters, Clouds. One of the data-intensive computing architecture that has been in the market is Map-Reduce Programming Model which was introduced by Google.

Map-Reduce Programming Model 

Map-Reduce Programming model handles embarrassingly parallel type of computations. Embarrassingly parallel problem means there may be lot of data but its the same operations/computations that needs to be processed on the data to achieve the results. 

The Map and Reduce function is similar to what we do in many situations. For example consider the query below:

SELECT Name , Role, Salary, Location From Employee Where Location = 'Mumbai'

The input to Map function can be to read all data with these conditions and provide the result to Reduce function.

Now consider the below query.

SELECT COUNT(Employee) , Location From Employee Where Location = 'Mumbai' Group by Location

The above query consists of two steps

Step 1. Collect and filter all data that matches the location = 'Mumbai'

Step 2. Count all records from Step 1.

Step 1 is what usually Map Function would do, whereas Step 2 is what Reduce function would do.

This is embarrassingly parallel problem because it was the same computations of collect data, filter data, count data that was applied on all records.

Now, lets assume the total number of records are 1 Petabyte (100 Tera bytes). Even though the computations are simpler now, it will still take lot of time to process the same. This is where Clusters (group of machines with similar architecture and used for similar purpose) can be utilized. 

A master process distributes the tasks to various machines on the cluster so that it can be executed quickly.

The 1 Petabyte data can be split into smaller data sets of few Mega bytes and given to Map function and Reduce Function. The Map and Reduce function would parallelly run on multiple machines belonging to a particular cluster. These machines could be termed as worker nodes/worker processes.

A master node/machine would in end collect all data extracted from multiple reduce function and return back the result.

From the developers perspective, he or she has to just write the Map function, Reduce function, configure data sources. The Map-Reduce Programming model which has its runtime, configured machines will take care of all parallel and distributed processing to return back the result quickly.

Map-Reduce Framework in Cloud

One can utilize Cloud to create the infrastructure needed for Data-Intensive Computing. Parallel Cluster is one such AWS Service. Hadoop framework is an example of Map-Reduce Framework model. 

AWS cloud provider provides Elastic Map Reduce (EMR) services to implement Map-Reduce programming model quickly.

Cloud providers makes it possible to provided both IaaS as well as PaaS services for Map-Reduce programming models. Cloud can easily scale to manage any size of data, infrastructure or computations.

For more details refer to presentation 1 / presentation 2 / presentation 3 .

Thank you.


Comments

Popular posts from this blog

11 Computing Platforms

15 Four layers of Cloud Architecture (Reference Model)

10 Virtualization Software - HyperV