Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar

Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.

Speaker

Weisheng (Vincent) Xie

Chief Data Scientist/Senior Director, China Telecom Bestpay

Vincent Xie (谢巍盛) is the Chief Data Scientist/Senior Director at Orange Financial, as head of the AI Lab, he built the Big Data & Artificial Intelligence team from scratch, successfully established the big data and AI infrastructure and landed tons of businesses on top, a thorough data-driven transformation strategy successfully boosts the company’s total revenue by many times. Previously, he worked at Intel for about 8 years, mainly on machine learning- and big data-related open source technologies and productions.

Watch View Slides

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar

Speaker

Weisheng (Vincent) Xie

Sign up for the Data Streaming Newsletter