A Roundup of Strata+Hadoop 2016 @ San Jose

Giorgos Psistakis

There goes another Strata+Hadoop event. They are becoming more and more interesting by the time. The Strata + Hadoop World Conference in San Jose just finished.

IMHO Streaming, Apache Flink, Apache Kafka and Apache Spark were winners of the lot. Let’s see a roundup of the event.


As Kostas Tzoumas, co-founder and CEO of data Artisans, said to George Gilbert from theCUBE of the SiliconANGLE Media team:

“If you ask me, I wonder why stream processing hasn’t been around for a long time.”

There is real value in data that are not static. IoT and sensors, events, customer behavior and activity, everything changes and flows. Streaming enables continuous processing on data that is continuously produced. Running analytics on top of all these has its own difficulties and challenges but at Strata+Hadoop seems that open source helps doing it mainstream.

“The majority of Big Data projects in the enterprise today are built resigned to the fact that extracting insight from data can be either approximated but fast (stream) or accurate but slow (batch)….This explosion of Internet of Things data is already driving demand for faster, more accurate data processing engines. Right now, that demand is being addressed by a number of dispersed technologies strapped together using custom code. This approach introduces enormous complexities and unnecessarily drives up overall development costs.”
Mark Chmarny – Director, Platform Engineering Big Data Solutions, at Intel Corp.

Data streaming is gaining popularity, as it offers decreased latency, a radically simplified data-infrastructure architecture, and the ability to cope with new data that is generated continuously.

+ A Nice presentation from Joey Echeverria, Director of Engineering of Rocana about the stream-processing landscape and data transformation in a streaming context. Get it here

As streaming processing is gaining momentum, so does frameworks like Apache Spark and Apache Flink. Companies like King, Otto Group and Capital One are using it. But even if you are not King, continuous data applications can be complex too.

data Artisans were founded by Kostas Tzoumas and the rest of the original development team that created Flink. They recently announced that raised about $6.5 million in a Series A round led by Intel Capital.

Kostas, in his presentation, made a nice wrap up between architectures like batch, λ (lambda) and streaming. He also explained:

  • How Flink behaves in latency terms. TL;DR Better than Spark, sub-sec like Storm.
  • How scalable it is and at what cost. TL;DR Better than Storm.
  • Fault tolerance. TL;DR Flink guarantees exactly-once consistency.
  • Event time and out-of-order arrivals
  • the future of Flink.

Here is his presentation:


#Apache Kafka

According to Confluent co-founder and Kafka co-creator Neha Narkhede, Kafka is like a central nervous system for data.

Kafka is like a central nervous system for data.

Apache is one of the hottest Apache projects out there. LinkedIn uses it, Uber uses it and many others in order to process million events per day. Confluent, is a startup that came out LinkedIn to further develop and promote the Apache Kafka big data message streaming technology.

There was a number of presentations about Kafka but the one from Todd Palino (Staff Site Reliability Engineer @ Linkedin) and Gwen Shapira (System Architect @ Confluent) was my favorite.

+ Confluent announced a public training to help companies reap the benefits of the platform. Check Confluent University.


#Apache Spark

Spark’s is here to stay. It’s taking the spotlight out of Hadoop some years now and its adoption is growing inside the big data ecosystem.

Dean Wampler Architect for Big Data products at Lightbend made an overview of Apache Spark and some improvements and lessons learned. You may see it here.

Here is the presentation from Alex Silva Chief Data Architect of Pluralsight on how to use Kafka, Spark Streaming, Akka, and Hadoop to orchestrate a real-time stack. Is is actualy a great way to build a solution, to provide fast data analysis. And it has everything there, from Spark streaming, Kafka, Avro, 10k Scala lines of code, Akka, Druid.


Last but not least is the #hadoop10. Hadoop co-creator Doug Cutting and Chief Architect at Cloudera, led the keynote at the Strata + Hadoop. He made a review of the last 10 years and provided a glimpse of the future and what it may hold for data.

“Hadoop’s legacy is creating a new way of developing an ecosystem with collaboration”

He believes that hardware was always a bottleneck. Now and with what comes in the future this is going to change.

“We are going to have the majority of data sets stored in memory, and that’s going to change the applications that we can build.”

#et al

US Department of Commerce: US Commerce Deputy Secretary Bruce Andrews made a keynote and spoke about how the US Department of Commerce is opening up new data sets to serve data scientists, and statisticians. More on his talk here.

“We are America’s data agency…We must make it as accessible as possible…We want to unleash our data so you can use it.”

Data science for good means designing for people: Part 1 & Part 2: It was a series of discussions on the UN sustainable development goals and how and what data they need to measure them and monitor. There was also discussion on privacy matters and liabilities. Click bellow on the first part..

And here is the second part.

#Slides and goodies from Strata+Hadoop 2016 San Jose

If you are looking for the slides, get them from O’Reilly here.
Nice report about Mapping Big Data here.

#Strata+Hadoop NY

and do not forget next Strata + Hadoop is in NY in a few months. The Call for speakers is ending.

Thanks for reading.