Use Case
11:30 AM - 12:10 PM PDT , June 18
Easily Build a Smart Pulsar Stream Processor
For organizations with boundless data sources, it is important to analyze, learn, predict and even respond in real time – directly from streaming data. This is important when:
•Data volumes are large, or moving raw data is expensive,
•Data is generated by widely distributed assets (eg: mobile devices),
•Data is of ephemeral value and analysis can’t wait, or
•It is critical to always have the latest insight and extrapolation won’t do.
Use cases include prediction of failures on assembly lines, prediction of traffic flows in cities, predicting demand placed in power grids, detection of hackers, and understanding connection quality in mobile networks. They are characterized by a need to know – now – and require real-time processing of streaming data. Our goal is to enable real-time stream processing for Apache Pulsar in which analysis, learning and prediction are done on-the-fly, with continuous insights streamed back to the broker.
Streaming data contains state changes on the part of real-world assets, systems, accounts, infrastructure and even people. The need to deliver insights instantly demands an architecture where streaming data is continuously processed – both to permit a real-time response, and to avoid storage and networks overflowing with data of ephemeral value. We use a simple architecture in which “things” are stateful, concurrent agents that interlink to form a context rich graph that enables real-time state sharing for sophisticated streaming analysis. Links are like a subscription that allow concurrent, stateful agents to share state in real-time to enable rich contextual analysis, unsupervised learning and prediction, and response. Importantly, we build the graph from the data itself, and endow agents with the capacity to continually analyze, learn and predict using rich contextual data.
The value of stateful, on-the-fly stream processing
Today developers need to overcome three challenges to build a stream processor:
•State matters, not data – Boundless data sources never stop – and analysis depends on the contextual meaning of the events – in other words state changes to the real-world sources. Pulsar Streaming is stateless, so analysis requires state changes to be held in a database.
•Building applications is complex. The application developer must explicitly manage the data-to-state conversion, state storage, and computational overhead of analysis tasks. This is complex, and difficult to maintain and understand. And these skills are in short supply.
• Infrastructure headaches: App developers have to explicitly manage scaling and computation, databases and network connectivity. A data pipeline delivers raw data to a cluster where both data-to-state conversion and analysis happen.
Why all the complexity? In this talk, you will learn about an easy way to achieve complex data-to-state conversion in a distributed graph of concurrent, stateful 'digital twins' close to data sources.
Simon will report results from this method on ~1PB/day of streaming data.
Speaker

Simon Crosby
CTO at Swim
Simon Crosby is CTO at Swim. Swim offers the first open core, enterprise-grade platform for continuous intelligence at scale, providing businesses with complete situational awareness and operational decision support at every moment. Simon cofounded Bromium in 2010 and now serves as a strategic advisor. Previously, he was the CTO of the Data Center and Cloud Division at Citrix Systems; founder, CTO, and vice president of strategy and corporate development at XenSource; and a principal engineer at Intel, as well as a faculty member at Cambridge University, where he led the research on network performance and control and multimedia operating systems. Simon is an equity partner at DCVC, serves on the board of Cambridge in America, and is an investor in and advisor to numerous startups. He is the author of 35 research papers and patents on a number of data center and networking topics, including security, network and server virtualization, and resource optimization and performance. He holds a PhD in computer science from the University of Cambridge, an MSc from the University of Stellenbosch, South Africa, and a BSc (with honors) in computer science and mathematics from the University of Cape Town, South Africa.