How Schema On Read vs. Schema On Write Started It All

Thomas Henson

Thomas Henson

Unstructured Data Engineer and Hadoop Black Belt at Dell EMC
Thomas Henson is a blogger, author, and podcaster in the Big Data Analytics Community. He is an Unstructured Data Engineer and Hadoop Black Belt at Dell EMC. Previously he worked helping Federal sector customers build their first Hadoop clusters. Thomas has been involved in the Hadoop Community since the early Hadoop 1.0 days. Connect with him @henson_tm.
Thomas Henson
Thomas Henson

Article originally appeared as Schema On Read vs. Schema On Write Explained.

Schema On Read vs. Schema On Write

What’s the difference between Schema on read vs. Schema on write?

How did Schema on read shift the way data is stored?

Since the inception of Relational Databases in the 70’s, schema on write has be the defacto procedure for storing data to be analyzed. However recently there has been a shift to use a schema on read approach, which has led to the exploding popularity of Big Data platforms and NoSQL databases. In this post let’s take a deep dive into what are the differences between schema on read vs. schema on write.

What is Schema On Write

Schema on write is defined as creating a schema for data before writing into the database. If you have done any kind of development with a database you understand the structured nature of Relational Database(RDBMS) because you have used Structured Query Language (SQL) to read data from the database.

One of the most time consuming task in a RDBMS  is doing Extract Transform Load (ETL) work. Remember just because the data is structured doesn’t mean it starts out that way. Most of the data that exist is in an unstructured fashion. Not only do you have to define the schema for the data but you must also structure it based on that schema.

For example (more…)

Architecture Changes in a Bound vs. Unbound Data World

Thomas Henson

Thomas Henson

Unstructured Data Engineer and Hadoop Black Belt at Dell EMC
Thomas Henson is a blogger, author, and podcaster in the Big Data Analytics Community. He is an Unstructured Data Engineer and Hadoop Black Belt at Dell EMC. Previously he worked helping Federal sector customers build their first Hadoop clusters. Thomas has been involved in the Hadoop Community since the early Hadoop 1.0 days. Connect with him @henson_tm.
Thomas Henson
Thomas Henson

Originally posted as Bound vs. Unbound Data in Real Time Analytics.

Breaking The World of Processing

Streaming and Real-Time analytics are pushing the boundaries of our analytic architecture patterns. In the big data community we now break down analytics processing into batch or streaming. If you glance at the top contributions most of the excitement is on the streaming side (Apache Beam, Flink, & Spark).

What is causing the break in our architecture patterns?

A huge reason for the break in our existing architecture patterns is the concept of Bound vs. Unbound data. This concept is as fundamental as the Data Lake or Data Hub and we have been dealing with it long before Hadoop. Let’s break down both Bound and Unbound data.

Bound vs. Unbound Data (more…)

Distributed Analytics Meets Distributed Data with a World Wide Herd

Jean Marie Martini

Jean Marie Martini

Director, Data Analytics Portfolio Messaging and Strategy at Dell EMC
Jean Marie Martini is a Director of messaging and strategy across the data analytics portfolio at Dell EMC. Martini has been involved in data analytics for over ten years. Today the focus is on communicating the value of the Dell EMC solutions to enable customers to begin and advance their data analytics journeys to transform their organizations into data-driven businesses. You can follow Martini on Twitter @martinij.

Originally posted on CIO.com by Patricia Florissi, Ph.D.

What is a World Wide Herd (WWH)?

What does it mean to have “Distributed analytics meet distributed data?” In short, it means having a group of industry experts, in this case a group given the title of World Wide Herd, to form a global virtual computing cluster. The WWH concept creates a global network of distributed Apache™ Hadoop® instances to form a single virtual computing cluster that brings analytics capabilities to the data. In a recent CIO.com blog, Patricia Florissi, Ph.D., vice president and global CTO for sales and a distinguished engineer for Dell EMC, details how this approach enables analysis of geographically dispersed data, without requiring the data to be moved to a single location before analysis. (more…)

Dell EMC Takes #1 Position on TPCx-BigBench for Scale Factor 10000

Nicholas Wakou

Nicholas Wakou

Nicholas Wakou is a Senior Principal Performance Engineer with the Dell EMC Open Source Solutions team. Nicholas's role, interest and activity is focused on the characterization and optimization of the performance of Dell EMC Cloud and Big Data solutions. Nicholas has been involved and is engaged with Industry efforts to define performance benchmark specifications. He is active on the SPEC (www.spec.org) Cloud committee and several committees of the TPC (www.tpc.org). Nicholas represents Dell Technologies on the Board of Directors of the TPC and on its Technical Advisory Board (TAB). Previously, he was Chair of the TPC Public Relations standing committee. Nicholas has an MS. Electrical Engineering from Oklahoma State University, MS. Microelectronics Technology from Middlesex University, London and a BSc. Electrical Engineering from Makerere University, Kampala, Uganda.
Nicholas Wakou

Latest posts by Nicholas Wakou (see all)

Dell EMC is focused on providing information that helps customers make the most of their big data technology investment. The failure rate for Hadoop big data projects is still too high given the maturity of the technology.  Customers can’t afford to guess when designing and sizing a solution; they need to deliver optimal performance for their business use cases and to scale as needed. Dell EMC recently completed and published a new TPCx-BigBench (TPCx-BB) result that will help customers make the right choices for Hadoop performance and scalability. Today we are happy to announce that

Dell EMC is the industry leading supplier of hyper-converged, converged and “Ready” Solutions by many standards.  Dell EMC’s tested and validated Ready Bundle for Cloudera Hadoop, together with the right performance benchmark results, takes the guess work out of Hadoop implementations.

The Transaction Processing Council (TPC) is a non-profit corporation founded (more…)

Follow Dell EMC

Dell EMC Big Data Portfolio

See how the Dell EMC Big Data Portfolio can make a difference for your analytics journey

Dell EMC Community Network

Participate in the Everything Big Data technical community