Saturday, August 24, 2013

EMC Greenplum Overview

Greenplum (www.greenplum.com), part of EMC's Data Products division, offers a massively parallel processing (MPP) data warehouse DBMS that runs on Linux and Unix.EMC offers the Greenplum database for license, as an appliance and as a certified configuration model (it cooperates with partners to deliver certified configurations). EMC (Greenplum) was one of the first vendors to combine DBMS processing with distributed processing clusters (Hadoop).

The Greenplum Database builds on the foundations of open source database PostgreSQL.It primarily functions as a data warehouse and utilizes a shared-nothing, massively parallel processing (MPP) architecture. In this architecture, data is partitioned across multiple segment servers, and each segment owns and manages a distinct portion of the overall data; there is no disk-level sharing nor data contention among segments.

Parallel query optimizer
Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan. Greenplum's optimizer uses a cost-based optimization algorithm[7] to evaluate potential execution plans, takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The resulting query plans contain traditional physical operations - scans, joins, sorts, aggregations, etc. - as well as parallel "motion" operations that describe when and how data should be transferred between nodes during query execution. Greenplum Database has three kinds of "motion" operations that may be found in a query plan: 

Broadcast Motion (N:N) - Every segment sends the target data to all other segments
Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

Within the execution of each node in the query plan, multiple relational operations are processed by pipelining. Pipelining is the ability to begin a task before its predecessor task has completed, and this ability is key to increasing basic query parallelism. For example, while a table scan is taking place, rows selected can be pipelined into a join process.