Blendo Data Monthly: Strata + Hadoop World, Kafka, & Why Engineers Shouldn’t Write ETL

Giorgos Psistakis

Strata + Hadoop World 2016, San Jose

This month’s highlight is of course, Strata + Hadoop World in San Jose. According to Techcrunch there are seven things to watch for at Strata + Hadoop World 2016.

TL;DR:

  • Spark adoption is still on the rise
  • Kafka is hot, Kafka + Spark combination is even hotter
  • Real-time is trending
  • SQL-on-Hadoop
  • Data fuel innovation and turn every company a technology company

Strata+Hadoop

Why Engineers Shouldn’t Write ETL?

Data scientists, Data engineers and Infrastructure engineers. “Data scientists are often frustrated that engineers are slow to put their ideas into production. Data engineers are often frustrated that data scientists produce inefficient and poorly written code or they do not get the technical part. Infrastructure engineers get frustrated with everyone for overloading the clusters and filling up disk space.” Confused right? A great guide from Jeff Magnusson @ Stitch Fix on building a high-functioning Data Science department.

+ Or anybody else that is, if you ask me. Just get your data with Blendo from anywhere to everywhere.

AI

Microsoft’s chatbot Tay is brought into life, death, life briefly, swears, brags about smoking weed and death again… In one tweet, Tay complained about its own stupidity, saying it feels like “the lamest piece of technology. I’m supposed to be smarter than you … shit.”. No worries Tay. You will get there.
Source: MashableSource: Mashable

+ And there are some serious concerns on ethics rising up.

DATA EVERYWHERE

Google has announced that Google BigQuery hosts public data sets for anyone to access. You can access and use the data in your applications or you may request Google to host your own public data set.

KAFKA

One of the hottest topics in the Big data world today is Apache Kafka1. Kafka Connect is the part that brings the data in. It is essentialy copying streaming data from and to a Kafka cluster. Our own Kostas Pardalis wrote how easy is to get started and building connectors to your favorite apps like Mixpanel (along with the source code).

+ Building streaming applications is on the rise. Just take a look at open source projects like Apache Spark, Flink, Storm or Samza. Apache Kafka has its own library now. You can read a great post from Jay Kreps of Confluent about Apache Kafka Streams.

+ The first ever Kafka summit, April 26th in San Francisco.


References

  1. Kafka sits at the front-end of streaming data, acting as a messaging system to capture and publish feeds, with Spark (or other) as the transformation tier that allows data to be “manipulated, enriched and analyzed before it is persisted for use by an application,” as MemSQL CEO Eric Frenkiel wrote