In the previous month’s Blendo Data Monthly we read about many interesting Machine Learning cases. This month?
Pikachu Ascending
Source: The Verge / Courtesy Nintendo
I like Pokemon Go. OK let the flame begin!
But I do not play a lot. Just casual but I am not alone. According to a survey from our (real-life) friends at Pollfish, which includes responses from 2,000 U.S. Pokémon Go users, 82%+ want to catch them all, 40% have walked between 11-30 miles to do so!
But if you want to go deeper have a look at this analysis of more than 700 Pokemon types using R. It was created by Joshua Kunst.
+ Or get all the Pokemon data you’ll ever need, through a modern RESTful API.
[Click to Tweet: Blendo Data Monthly: Pikachu Ascending]
Data Engineering
Billions of Messages a Day – Yelp’s Real-time Data Pipeline
The engineering team at Yelp, started a series of posts about Yelp’s real-time streaming data infrastructure. They plan to write how Yelp streams MySQL updates in real-time with an exactly-once guarantee, about schema track & migration, data processing and transformation of their streams, and how they push all the data into Redshift or Salesforce.
The first post provides a high level architecture with Apache Avro, Apache Kafka, Apache Spark, Amazon Redshift and more.
The architecture is so transparent that as a service author, it means that if you publish an event today, you can ingest that event into Amazon Redshift and our data lake, index it for search, cache it in Cassandra, or send it to Salesforce or Marketo without writing any code. That same event can be consumed by any other service, or by any future application we build, without modification.
Read the rest about Yelp’s Real-time Data Pipeline.
The second post is everything you need to know about streaming MySQL tables in real-time to Apache Kafka.
Looking forward for the next posts 🙂
+ If you were wandering Postgres or MySQL, here is an explanation why Uber chose to switch from Postgres to MySQL for building certain schemaless and other backend services on top of MySQL.
[Click to Tweet: Blendo Data Monthly: #dataengineering with Yelp’s team]
Metrics
Statistics for Business Analytics 101
Statistical tools are just a different approach to practices that you are already following in your work. For Marketing: What number of e-mail do we need to send before someone signs up for the product?
For Product: How long the onboarding process takes for a new sign-up?
For Sales: How long it takes for a lead to turn into a paying customer?
For Customer support: How long it takes from the moment a new ticket is created until the first response?
So where do we get from there?
+ Read how Romy Macasieb, Walker & Co’s Product Manager, defines metrics.
[Click to Tweet: Blendo Data Monthly: Statistics for Business Analytics 101 and Defining Metrics]
Data Science & Machine Learning
Language modeling a billion words
The people at Torch, wanted to build a language model which maximizes the likelihood of the next word given the history of previous words in the sentence. Here is their outcome!

Michael Nielsen’s “Neural Networks and Deep Learning” is one of the best starting points for Deep Learning & Neural Networks. At Fermat’s Library they are trying to take it a level up. They want our help to annotate it with content like videos, presentations, blog posts, code and formulas that could enhance the book and make it even better and easier to understand.

+ Algorithmia introduced a solution for hosting and distributing trained deep learning models in the cloud. They provide initial support for the Caffe, Theano, and TensorFlow and they have an API.
+ Detecting outliers? Over high volumes of data? enter Sharing-aware Outlier Processing (SOP). It addresses the case of building a stream-based outlier detection system that can handle large numbers of outlier detection queries simultaneously.
+ What is the surprising relation between product managers and data scientists? And how can Statistics make you better at Marketing for example?
Programming
Data Journalism with R at FiveThirtyEight
We wrote about FiveThirtyEight two Blendo Data Monthly issues back when they wrote an article about “Who would win in Captain America: Civil War?” FiveThirtyEight is doing data-driven analysis of media, culture, politics and society. Data Journalism at its best that is. This post is about how they do data science, where Andrew Flowers (Quantiative Editor) speaks about how their team uses R to produce high-quality data journalism content.

+ In case you were looking how to transform tabular data efficiently and fast. You can do it with R!.
+ Statistical Data Analysis in Python
Training
A Course in Machine Learning (CIML). CIML is a set of introductory materials on Machine Learning about supervised learning, unsupervised learning, large margin methods, probabilistic modeling, learning theory, and more. There are a print copy and e-book version.
Books
Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz & Shai Ben-David – Have to have solid Math ground! Amazon link

+ Leaving for vacations? Here are 5 FREE Data Science e-books for your reading list.
Enjoyed this month’s Blendo Data Digest? Want more? Read more Blendo Data Monthly posts.
Want to save hours on your data managment tasks? Blendo can help! Try it for FREE now!