The Internet of Things (IoT) is all about connected devices that produce and exchange data, and building applications that produce insights from these high volumes of data is very challenging and require a understanding of multiple protocols, platforms and other components. On this session, we will start by providing a quick introduction to IoT, some of the common analytic patterns used on IoT, and also touch on the MQTT protocol and how it is used by IoT solutions some of the quality of services tradeoffs to be considered when building an IoT application. We will also discuss some of the Apache Spark platform components, the ones utilized by IoT applications to process devices streaming data.
We will also talk about Apache Bahir and some of its IoT connectors available for the Apache Spark platform. We will also go over the details on how to build, test and deploy an IoT application for Apache Spark using the MQTT data source for the new Apache Spark Structure Streaming functionality.
2. About me - Luciano Resende
2
Data Science Platform Architect – IBM – CODAIT
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache
Toree, Apache Spark among other projects related to AI/ML platforms
lresende@apache.org
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
3. Open Source Community Leadership
C O D A I T
Founding Partner 188+ Project Committers 77+ Projects
Key Open source steering committee
memberships OSS Advisory Board
Open Source
4. Center for Open Source
Data and AI Technologies
CODAIT
codait.org
codait (French)
= coder/coded
https://m.interglot.com/fr/en/codait
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
5
7. Apache Spark Introduction
8
Spark Core
Spark
SQL
Spark
Streaming
Spark
ML
Spark
GraphX
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
general compute engine, handles
distributed task dispatching,
scheduling and basic I/O
functions
large variety of data sources
and formats can be supported,
both on-premise or cloud
BigInsights
(HDFS)
Cloudant
dashDB
SQL
DB
9. Apache Spark – Spark SQL
10
Spark
SQL
Unified data access APIS: Query
structured data sets with SQL or
Dataset/DataFrame APIs
Fast, familiar query language across
all of your enterprise data
RDBMS
Data Sources
Structured
Streaming
Data Sources
10. Apache Spark – Spark SQL
11
You can run SQL statement with SparkSession.sql(…) interface:
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
spark.sql(“create table T1 (c1 int, c2 int) stored as parquet”)
val ds = spark.sql(“select * from T1”)
You can further transform the resultant dataset:
val ds1 = ds.groupBy(“c1”).agg(“c2”-> “sum”)
val ds2 = ds.orderBy(“c1”)
The result is a DataFrame / Dataset[Row]
ds.show() displays the rows
11. Apache Spark – Spark SQL
You can read from data sources using SparkSession.read.format(…)
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = spark.read.csv(“hdfs://localhost:9000/data/bank.csv").as[Bank]
// loading JSON data to a Dataset of Bank type
val bankFromJSON = spark.read.json(“hdfs://localhost:9000/data/bank.json").as[Bank]
// select a column value from the Dataset
bankFromCSV.select(‘age).show() will return all rows of column “age” from this dataset.
12
12. Apache Spark – Spark SQL
You can also configure a specific data source with specific options
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
// loading csv data to a Dataset of Bank type
val bankFromCSV = sparkSession.read
.option("header", ”true") // Use first line of all files as header
.option("inferSchema", ”true") // Automatically infer data types
.option("delimiter", " ")
.csv("/users/lresende/data.csv”)
.as[Bank]
bankFromCSV.select(‘age).show() // will return all rows of column “age” from this dataset.
13
13. Apache Spark – Spark SQL – Data Sources
Data Sources under the covers
- Data source registration (e.g. spark.read.datasource)
- Provide BaseRelation implementation
• That implements support for table scans:
– TableScans, PrunedScan, PrunedFilteredScan, CatalystScan
- Detailed information available at
• https://developer.ibm.com/code/2016/11/10/exploring-apache-spark-datasource-api/
14
14. Apache Spark – Spark SQL – Data Sources
Data Sources V1 Limitations
- Leak upper-level API in the data source (DataFrame/SQLContext)
- Hard to extend the Data Sources API for more optimizations
- Zero transaction guarantee in the write APIs
- Limited Extensibility
15
15. Apache Spark – Spark SQL – Data Sources
Data Sources V2
- Support for row-based scan and columnar scan
- Column pruning and filter push-down
- Can report basic statistics and data partitioning
- Transactional write API
- Streaming source and sink support for micro-batch and continuous
mode
- Detailed information available at
• https://developer.ibm.com/code/2018/04/16/introducing-apache-spark-data-sources-api-v2/
16
16. Apache Spark – Spark SQL Structured Streaming
Unified programming model for streaming, interactive and batch queries
17Image source: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Considers the data stream as unbounded
table
17. Apache Spark – Spark SQL Structured Streaming
SQL regular APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.read
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. save(” dest-path”)
18
Structured Streaming APIs
val spark = SparkSession.builder()
.appName(“Demo”)
.getOrCreate()
val input = spark.readStream
.schema(schema)
.format(”csv")
.load(”input-path")
val result = input
.select(”age”)
.where(”age > 18”)
result.write
.format(”json”)
. startStream(” dest-path”)
18. Apache Spark – Spark Streaming
19
Spark
Streaming
Micro-batch event processing for
near-real time analytics
e.g. Internet of Things (IoT) devices,
Twitter feeds, Kafka (event hub), etc.
No multi-threading or parallel process
programming required
19. Apache Spark – Spark Streaming
Also known as discretized stream or DStream
Abstracts a continuous stream of data
Based on micro-batching
Based on RDDs
20
20. Apache Spark – Spark Streaming
val sparkConf = new SparkConf()
.setAppName("MQTTWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val lines = MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_ONLY_SER_2)
val words = lines.flatMap(x => x.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
21
22. Origins of the Apache Bahir Project
MAY/2016: Established as a top-level Apache Project.
- PMC formed by Apache Spark committers/pmc, Apache Members
- Initial contributions imported from Apache Spark
AUG/2016: Apache Flink community join Apache Bahir
- Initial contributions of Flink extensions
- In October 2016 Robert Metzger elected committer
23. Origins of the Bahir name
Naming an Apache Project is a science !!!
- We needed a name that wasn’t used yet
- Needed to be related to Spark
We ended up with : Bahir
- A name of Arabian origin that means Sparkling,
- Also associated with a guy who succeeds at everything
24. Why Apache Bahir
It’s an Apache project
- And if you are here, you know what it means
Benefits of curating your extensions at Apache Bahir
- Apache Governance
- Apache License
- Apache Community
- Apache Brand
25
25. Why Apache Bahir
Flexibility
- Release flexibility
• Bounded to platform or component release
Shared infrastructure
- Release, CI, etc
Shared knowledge
- Collaborate with experts on both platform and component areas
26
26. Bahir extensions for Apache Spark
MQTT – Enables reading data from MQTT Servers using Spark Streaming or Structured streaming.
• http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/
• http://bahir.apache.org/docs/spark/current/spark-streaming-mqtt/
Couch DB/Cloudant – Enables reading data from CouchDB/Cloudant using Spark SQL and Spark
Streaming.
Twitter – Enables reading social data from twitter using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-twitter/
Akka – Enables reading data from Akka Actors using Spark Streaming or Structured Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-akka/
ZeroMQ – Enables reading data from ZeroMQ using Spark Streaming.
• http://bahir.apache.org/docs/spark/current/spark-streaming-zeromq/
27
27. Bahir extensions for Apache Spark
Google Cloud Pub/Sub – Add spark streaming connector to Google Cloud Pub/Sub
28
28. Apache Spark extensions in Bahir
Adding Bahir extensions into your application
- Using SBT
libraryDependencies += "org.apache.bahir" %% "spark-streaming-mqtt" % "2.2.0”
- Using Maven
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-mqtt_2.11 </artifactId>
<version>2.2.0</version>
</dependency>
29
29. Apache Spark extensions in Bahir
Submitting applications with Bahir extensions to Spark
- Spark-shell
bin/spark-shell --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
- Spark-submit
bin/spark-submit --packages org.apache.bahir:spark-streaming_mqtt_2.11:2.2.0 …..
30
31. IoT – Definition by Wikipedia
The Internet of things (IoT) is the network of
physical devices, vehicles, home
appliances, and other
items embedded with electronics, software,
sensors, actuators, and network
connectivity which enable these objects to
connect and exchange data.
32
32. IoT – Interaction between multiple entities
33
Things Software
People
control
observe
inform
command
actuate
inform
33. IoT Patterns – Some of them …
35
• Remote control
• Security analysis
• Edge analytics
• Historical data analysis
• Distributed Platforms
• Real-time decisions
35. MQTT – Quality of Service
38
MQTT
Broker
QoS0
QoS1
QoS2
At most once
At least once
Exactly once
. No connection failover
. Never duplicate
. Has connection failover
. Can duplicate
. Has connection failover
. Never duplicate
36. MQTT – World usage
Smart Home Automation
Messaging
Notable Mentions:
- IBM IoT Platform
- AWS IoT
- Microsoft IoT Hub
- Facebook Messanger
39
38. IoT Simulator using MQTT
The demo environment
https://github.com/lresende/bahir-iot-demo
41
Node.js Web app
Simulates Elevator IoT devices
Elevator simulator
Metrics:
• Weight
• Speed
• Power
• Temperature
• System
MQTT
Mosquitto
40. Summary – Take away points
Apache Spark
- IoT Analytics Runtime with support for ”Continuous Applications”
Apache Bahir
- Bring access to IoT data via supported connectors (e.g. MQTT)
IoT Applications
- Using Spark and Bahir to start processing IoT data in near real
time using Spark Streaming and Spark Structured Streaming
43