In the early days and weeks of any widespread global health concern, particularly in a fast-moving outbreak like the coronavirus, there are many unknowns. Data visualization can be a good starting point to understand trends and piece data points together into a meaningful story. The ability to visualize the spread of the virus can help raise awareness, understand its impact — and ultimatelyassist in prevention efforts.
On December 31, 2019, the World Health Organization’s China Country Office was informed of cases of pneumonia with unknown cause detected in Wuhan City, Hubei Province of China. Since its initial reporting, the new coronavirus (SARS-CoV-2) has spread in a global outbreak, infecting tens of thousands in more than 30 countriesand causing the COVID-19 acute respiratory disease.
Find out the number of new coronavirus cases within the last 10 days and see how the virus’ infection rates, recovery rates, and fatality rates are trending.
Discover where the virus has migrated and compare the epicenter China with the rest of the world.
Analyze the confirmed cases to understand how the recovery rate is changing over time.
For a rapid glance of summary information that’s refreshed daily with global statistics on the COVID-19 outbreak, start here.
Key insights from various reports are embedded in this web page, letting you view and interact with the data and keeping you updated with the latest numbers.
If you want to see more details by geography and explore the interactive report, just click on the “Full report”button at the top of the page to launch the full dashboard.
Using a dashboard view, you can easily see an overview of the COVID-19 disease outbreak based on data updated daily, including the number of confirmed new cases, recovered cases and deaths from the virus filtered by geographic location.
Overview of COVID-19 disease outbreak, including the number of confirmed new cases, recovered cases and deaths from the virus filtered by geographic location.
Time-series graphs (below) compare the confirmed cases to recoveries and deaths and cases status by reporting period (Note: You can maximize the view of each visual in the report to see more details).
Time-series graphs compare the confirmed cases to recoveries and deaths and cases status by reporting period.
Back on the Status Tab of the dashboard, when we look at the number of cases in Mainland China (on the right side of the screenshot below), we see that Hubei Sheng province is an outlier, as it holds the city of Wuhan, the epicenter of the outbreak (Hint: Use the interactive filters in the report to focus on the specific country to see the details).
View of the “Status Tab” of the dashboard.
The Locations Tab of the report reveals country-specific coronavirus data. Click on “Other Locations” to see a global view.
The Locations Tab of the report shows country-specific coronavirus data.
Use the interactive filters to focus on the specific country to see the details.
Interactive filters allow you to focus on specific countries and see details.
It’s been many weeks since the first case of the new coronavirus was reported in China and the outbreak has spread across the globe. By adding a layer of geospatial data from Esri’sGIS mapping software, we’re able to explore an interactive view of the spread of coronavirus across China and into other countries.
With SAS Visual Analytics, we cansee a time-series animation (below) that demonstrates the spread of the virus across the globe. Play the animation to see the spread within China, as well as spread and severity across the globe.
What’snext
This information, presented in new ways on maps and animated timelines, is just the tip of the iceberg as public health officials and life sciences companies work to contain the COVID-19 outbreak and develop antiviral medicines to combat the disease. Data-driven techniques such as analytics and AIreleased on a diverse set of data points such as clinical patientrecords, social media streams and public health records among others can all help to refine the surveillance approaches as long as regulatory and citizen privacy laws are respected.
Further development of data science projects and collaboration within the health and life sciences industry is needed. Those interested in a code-based approach may want to check out these tips from graph expert Robert Allison.
As the data and this outbreak evolve, so will our analytics and reports evolve to be even more meaningful to the global health community. Stay healthy and let us hear from you about ways that SAS Visual Analyticsand other SAS analytical technologies can help (or are helping) shed light on the current coronavirus epidemic.
Over the last decade, no trend in computing has generated more buzz than “Big Data”.
Computer makers, software companies, and IT experts frequently announce that it’s changing everything. Your company is told it needs to incorporate big data or get left behind.When you explore the potential and contact vendors, it’s confusing to understand what big data is and how it integrates into the company.
A lot of advertising is vague, promising to refine data management without explaining how it’s done.
Once big data processes are installed, information assets become more and more critical. At the same time, they consume more and more resources. Before long, there’s a point where the cost of resources will offset data processing gains. Some of the best methods for walking this tightwire include fast data and cloud management.
What Is Big Data?
The term “Big Data” refers to a concept for processing large amounts of stored information.
The idea is that every piece of information a company can harvest from internal workflows, customer data, and marketing research is useful.
When data of various types is cross-referenced, new insights come out of that process.
For instance, customer addresses, dates of purchase, and brand preference can reveal how different regions compare in preferences and buying habits. Comparisons like this produce high value, customer-centered information that becomes a force multiplier. Now you’re creating marketing and customer relations targets that are very specific.
Systems such as AI and connected devices on the manufacturing line, in the marketing department, and at corporate offices can use this data to tailor the customer experience and drive product development.
The main problem with big data is that because systems gather so much information from so many channels, it takes lots of storage.
Queries to the database take longer to finish because there’s more to sort through.
Standard practice is to run hourly or daily batches of data sorting and queries, then apply the findings to company processes. It takes time (money) to apply results from this process to other company processes.
What is Fast Data?
The biggest problem with big data is how big it can get.
Recent technology advances have added tremendous data-gathering abilities to the system. The Internet of Things (IoT) is growing exponentially, with 50 billion connected devices projected by 2022.
Faster and faster network speeds contribute to the pace at which the data piles up.
The approaching 5G standard will only add to the load.
If only the speed of processing can be improved, these embedded systems can respond much faster to requests for information. That would result in timelier responses to consumer needs or requests for changes by corporate departments.
The company itself becomes far more agile, improving its capabilities to respond to threats from competition, economic swings, and changing demand.
This is where fast data provides a solution to the big data problem.
Big data is too big to use standard relational databases and spreadsheets for analysis. Fast data works around this issue by analyzing data in real-time as it streams into storage.
How much data does a company need to store?
Other than legal requirements, much of the information gathered on buying habits and consumer preferences is unnecessary once it’s been processed. Once established, the relationships between these sets of data can themselves be stored, and the raw data doesn’t need to be kept anymore. This frees up resources and lowers the cost of maintaining a robust big data process.
From Big Data to Fast Data
This discussion isn’t about replacing big data systems with fast data systems. The two concepts work together.
Think about the sheer amount of data gathered in virtually no time at all by systems like financial institutions, traffic management systems, research networks, and weather scientists.
Solutions such as cloud computing have solved many of the storage and management issues related to big data processes.
Now those same systems can analyze client preferences or website habits and automatically respond with rapid, decisive action as that event occurs. Computing power has been increasing at a mind-blowing pace, providing a solution to the time factor involved in processing big data.
Fast data brings whole new capabilities to enterprise response time.
Systems that use AI to process streaming data can respond in real-time now to prevent bank fraud, for instance, when it used to take days or even weeks to spot the signs. This capability is a game-changer like few other recent advances. Meteorologists can spot dangerous weather events like tornadoes and give ample warning time for residents.
The purpose of fast data is to provide solutions and improve capabilities to make use of big data, not to replace big data.
In a nutshell, fast data makes big data more usable in the real-time business environment. Businesses that recognize the capabilities offered by fast data will have a competitive advantage over those in their industries that don’t.
But fast data isn’t the perfect solution for every business.
For one thing, it can be expensive to provide the IT knowledge to install, maintain, and operate such a system. Software and hardware will both require significant investments in labor and purchasing costs. Companies that don’t need to respond quickly or in real-time, or that don’t already have significant big data processes in place may not find these investments worth the cost.
Shifting Data Processing Emphasis
The best way to think of these concepts is to visualize big data as deep thinking, while fast data can be considered as decisive action.
While both are necessary, each operates differently on the same sets of data.
Cloud storage of big data provides raw data to mix and match for insights. Fast data analyzes that data as it comes in before it goes up to the cloud and triggers actions related to those insights. It’s the best of both worlds.
Vast amounts of decision-making information are available in storage, and it can be analyzed to create real-time decisions that improve the bottom line.
Forrester recently released its “Now Tech: Enterprise Architecture Management Suites for Q1 2020” to give organizations an enterprise architecture (EA) playbook.
It also highlights select enterprise architecture management suite (EAMS) vendors based on size and functionality, including erwin.
The report notes six primary EA competencies in which we excel in the large vendor category: modeling, strategy translation, risk management, financial management, insights and change management.
Given our EA expertise, we thought we’d provide our perspective on the report’s key takeaways and how we see technology trends, business innovation and compliance driving companies to use EA in different ways.
Improve Enterprise Architecture with EAMS
To an EA professional, it may seem obvious that tools provide “a holistic view of business demand impact.” Delivery of innovation at speed is critical, but what does that really mean?
Not only should EA be easy to adopt and roll out, artifacts should be easy to visualize quickly and effectively by various stakeholders in the format they need to make decisions rapidly.
Just as an ERP system is a fundamental part of business operations, so is an enterprise architecture management suite. It’s a living, breathing tool that feeds into and off of the other physical repositories in the organization, such as ServiceNow for CMDB assets, RSA Archer for risk logs, and Oracle NetSuite and Salesforce for financials.
Being able to connect the enterprise architecture management suites to your business operating model will give you “real-time insights into strategy and operations.”
And you can further prove the value of EA with integrations to your data catalog and business glossary with real-time insights into the organization’s entire data landscape.
Select Enterprise Architecture Vendors Based on Size and Functionality
EA has re-emerged to help solve compliance challenges in banking and finance plus drive innovation with artificial intelligence (AI), machine learning (ML) and robotic automation in pharmaceuticals.
These are large organizations with significant challenges, which require an EA vendor to invest in research and development to innovate across their offerings so EA can become a fundamental part of an organization’s operating model.
We see the need for a “proprietary product platform” in the next generation of EA, so customers can create their own products and services to meet their particular business needs.
They’re looking for product management, dev/ops, security modeling, personas and portfolio management all to be part of an integrated EA platform. In addition, customers want to ensure platforms are secure with sound coding practices and testing.
Determine the Key Enterprise Architecture Capabilities Needed
With more than 20 years of EA experience, erwin has seen a lot of changes in the market, many in the last 24 months. Guess what? This evolution isn’t slowing down.
We’re working with some of the world’s largest companies (and some smaller ones too) as they try to manage change in their respective industries and organizations.
Yesterday’s use case may not serve tomorrow’s use case. An EA solution should be agile enough to meet both short-term and long-term needs.
Use EA Performance Measures to Validate Enterprise Architecture Management Suite Value
EA should provide a strong ROI and help an organization derive value and successful business outcomes.
Additionally, a persona-based approach that involves configuring the user interface and experience to suit stakeholder needs eases the need for training.
Formalized training is important for EA professionals and some stakeholders, and the user interface and experience should reduce the need for a dedicated formal training program for those deriving value out of EA.
Why erwin for Enterprise Architecture?
Whether documenting systems and technology, designing processes and value streams, or managing innovation and change, organizations need flexible but powerful EA tools they can rely on for collecting the relevant information for decision-making.
Like constructing a building or even a city – you need a blueprint to understand what goes where, how everything fits together to support the structure, where you have room to grow, and if it will be feasible to knock down any walls if you need to.
Without a picture of what’s what and the interdependencies, your enterprise can’t make changes at speed and scale to serve its needs.
erwin Evolve is a full-featured, configurable set of enterprise architecture tools, in addition to business process modeling and analysis.
The combined solution enables organizations to map IT capabilities to the business functions they support and determine how people, processes, data, technologies and applications interact to ensure alignment in achieving enterprise objectives.
See for yourself why we were included in the latest Forrester EAMS report. We’re pleased to offer you a free trial of erwin Evolve.
To benchmark the performance of our newly released RedisTimeSeries 1.2 module, we used the Time Series Benchmark Suite (TSBS). A collection of Go programs based on the work made public by InfluxDB and TimescaleDB, TSBS is designed to let developers generate datasets and then benchmark read and write performance. TSBS supports many other time-series databases, which makes it straightforward to compare databases.
This post will delve deep into the benchmarking process, but here’s the key thing to remember: RedisTimeSeries is fast…seriously fast! And that makes RedisTimeSeries by far the best option for working with time-series data in Redis:
There is no performance degradation by the compression if the shards are not CPU bound.
Performance does not degrade when the cardinality increases (see note below).
Performance does not degrade when you add more samples to a time series.
RedisTimeSeries 1.2 can improve query latency up to 50% and throughput up to 70% compared to version 1.0.The more complex the query, the bigger the performance gain.
To compare RedisTimeSeries 1.2 with version 1.0.3, we choose three datasets: The first two have the same number of samples per time series but differ in cardinality.
Note: The maximum cardinality of a time-series dataset is defined as the maximum number of distinct elements that the dataset can contain or reference in any given point in time. For example, if a smart city has 100 Internet of Things (IoT) devices, each reporting 10 metrics (air temperature, Co2 level, etc.), spread across 50 geographical points, then the maximum cardinality of this dataset would be 50,000 [100 (deviceId) x 10 (metricId) x 50 (GeoLocationId)].
We chose these two datasets to benchmark query/ingestion performance versus the cardinality. The third dataset has the same cardinality as the first, but has three times as many samples in each time series. This dataset was used to benchmark the relationship between ingestion time and the number of samples in a time series.
Test case
Interval (seconds)
Data points per series
Total series
Total data points (millions)
30-day interval for 100 devices x 10 metrics (cardinality:1K)
2,592,000
259,200
1,000
259.2
30-day interval for 1K devices x 10 metrics (cardinality: 10K)
2,592,000
259,200
10,000
2,592
90-day interval for 100 devices x 10 metrics (cardinality: 1K)
7,776,000
777,600
1,000
777.6
Benchmark infrastructure
The performance benchmarks were run on Amazon Web Services instances, provisioned through Redis Labs’ benchmark testing infrastructure. Both the benchmarking client and database servers were running on separate c5.24xlarge instances. The database for these tests was running on a single machine with Redis Enterprise version 5.4.10-22 installed. The database consisted of 10 master shards.
In addition to these primary benchmark/performance analysis scenarios, we also enable running baseline benchmarks on network, memory, CPU, and I/O, in order to understand the underlying network and virtual machine characteristics. We represent our benchmarking infrastructure as code so that it is stable and easily reproducible.
Ingestion benchmarks
The table below compares the throughput between the RedisTimeSeries version 1.0.3 and the new version 1.2 for all three datasets. You can see that the difference between the two versions is minimal. We did, however, introduce compression, which consumed 5% more additional CPU cycles. From this, we can conclude that if the shards are not CPU bound, then there is no throughput degradation by the compression.
Test case
# Samples (millions)
v1.0.3
v1.2
% diff
30-day interval for 100 devices x 10 metrics (cardinality: 1K)
259.20
354,812.17
363,562.25
2.47%
↑
No degradation by compression
↓
30-days interval for 1K devices x 10 metrics (cardinality: 10K)
2,592.00
349,522.72
361,519.57
3.43%
90-day interval for 100 devices x 10 metrics (cardinality: 1K)
777.60
352,025.35
343,665.92
-2.37%
% diff cardinality 1K vs. 10K
-1.49%
-0.56%
← No degradation by cardinality →
The last row of the chart compares throughput over the first two datasets. There is almost no difference, which tells us that the performance does not degrade when the cardinality increases. Most other time-series databases degrade performance when the cardinality increases because of the underlying database and indexing technologies they use.
The three images below track throughput, latency, and memory consumption during the ingestion of the third (and largest) dataset. We inserted 800 million samples into a single database over the course of less than two hours. What is important here is that the latency and throughput do not degrade when there are more samples in a time series.
Screenshot of the Grafana dashboard monitoring throughput during the ingestion phase.
Screenshot of the Grafana dashboard monitoring latency during the ingestion phase.
Screenshot of the Grafana dashboard monitoring the memory consumed by Redis.
Query performance
TSBS includes a range of different read queries. The charts below represent the query rate and query latency of multi-range queries comparing RedisTimeSeries version 1.0.3 to version 1.2. They show that query latency can improve up to 50% and throughput can increase up to 70%, depending on the query complexity, the number of accessed time series to calculate the response, and query time range. In general, the more complex the query, the more visible the performance gain.
This behavior is due to both compression and changes to the API. Since more data fits in less memory space, fewer blocks of memory accesses are required to answer the same queries. Similarly, changes in the API’s default behavior of not returning the labels of each time series leads to substantial reductions in the load and overall CPU time on each TS.MRANGE command.
Memory utilization
The addition of compression in RedisTimeSeries 1.2 makes it interesting to compare memory utilization in these three datasets. The result is a 94% reduction in memory consumption for all three datasets in this benchmark. Of course, this is a lab setup where timestamps are generated in fixed time intervals, which is ideal for double-delta compression (for more on double-delta compressions, see RedisTimeSeries Version 1.2 Is Here!). As noted, a memory reduction of 90% is common for real-world use cases.
Test case
# of data points
(millions)
Memory used
Compression rate
v1.0.3
v1.2
30-day interval for 100 devices x 10 metrics (cardinality: 1K)
259
4.51GB
269MB
94.71%
30-day interval for 1K devices x 10 metrics (cardinality: 10K)
2500
44.8GB
2.4GB
94.60%
90-day interval for 100 devices x 10 metrics (cardinality: 1K)
780
13.5GB
736MB
94.49%
RedisTimeSeries is seriously fast
When we launched RedisTimeSeries last summer, we benchmarked it against time-series modelling options with vanilla data structures in Redis, such as sorted sets and hashes or streams. In memory consumption, it already outperformed the other modeling techniques apart from Streams, which consumed half the memory that RedisTimeSeries did. With the introduction of Gorilla compression (more on that in this post: RedisTimeSeries Version 1.2 Is Here!), RedisTimeSeries is by far the best way to persist time series data in Redis.
In addition to demonstrating that there is no performance degradation by compression, the benchmark also showed there is no performance degradation by cardinality or by the number of samples in time series. The combination of all these characteristics is unique in the time-series database landscape. Add in the greatly improved read performance, and you’ll definitely want to check out RedisTimeSeries for yourself.
Finally, it’s important to note that the time-series benchmarking ecosystem is rich and community-driven—and we’re excited to be a part of it. Having a common ground for benchmarking has proven to be of extreme value in eliminating performance bottlenecks and hardening every solution in RedisTimeSeries 1.2. We have already started contributing to better understanding latency and application responsiveness on TSBS, and plan to propose further extensions to the current benchmarks.
Editor’s note: Amanda Makulec is joining as an advisor to the Coronavirus Data Resource Hub. As both a Masters of Public Health and the Operations Director for the Data Visualization Society, she’s an expert in the responsible use of data visualization for public health. She will be helping the Tableau team identify data resources, curate visualizations, and ensure that what is available through the hub is of the highest quality and consistent with responsible information sharing during a critical time. Follow her at @abmakulec and on The Nightengale, the journal of DVS.
Teams are making ready-to-use COVID-19 datasets easily accessible for the wider data visualization and analysis community. Johns Hopkins posts frequently updated data on their github page, and Tableau has created a COVID-19 Resource Hub with the same data reshaped for use in Tableau.
These public assets are immensely helpful for public health professionals and authorities responding to the epidemic. They make data from multiple sources easy to use, which can enable quick development of visualizations of local case numbers and impact.
At the same time, the stakes are high around how we communicate about this epidemic to the wider public. Visualizations are powerful for communicating information, but can also mislead, misinform, and—in the worst cases—incite panic. We are in the middle of complete information overload, with hourly case updates and endless streams of information.
As a public health professional, might I ask:
“Please consider if what you’ve created serves an actual information need in the public domain. Does it add value to the public and uncover new information? If not, perhaps this is one viz that should be for your own use only.”
We want to help flatten the curve to minimize strain on our health system. The best way to do that is to take individual actions to slow the speed of transmission—like washing your hands and self-quarantining if exposed—and amplifying the voices of experts.
If you only learn one thing about #COVID19 today make it this: everyone’s job is to help FLATTEN THE CURVE. With thanks to @XTOTL & @TheSpinoffTV for the awesome GIF. Please share far & wide. pic.twitter.com/O7xlBGAiZY
If, after reading all of these caveats and warnings about the harm and panic that can be caused by misleading visualizations, you’ve decided to explore and visualize data about COVID-19, here are ten considerations for your design process.
Today is PI DAY! Obviously, Pi has rather more than 2 decimal places. To have some fun, let’s use Jet to drive multiple Python workers to calculate Pi with increasing accuracy.
Pi is the ratio of a circle’s radius to its circumference. It’s 3.14. It’s 3.1416. It’s 3.14159266 or whatever. It has apparently an infinite number of decimal places.
But how to calculate it?
There are various ways, but here we’re going to use the “Monte Carlo method.”
Random points on a square that contains a circle
Picture a circle that exactly fits inside a square so it touches at the edges, as per the diagram. If we randomly generate points inside the square (the green crosses), some will be inside the circle and some won’t. The proportion of points inside compared to outside multiplied by 4 gives us an approximation to Pi. The more random points we generate, the better the approximation becomes.
Our circle has a radius of 1.0 and is centered on (0, 0).
We generate random points within the square where X varies randomly between -1 inclusive and +1 inclusive, and where Y varies randomly between -1 inclusive and +1 inclusive.
To calculate whether the generated (X, Y) coordinate is within the circle, all we need do is add X squared to Y squared and see if these two values added together are less than or equal to the radius squared. Conveniently, but not coincidentally, the radius is 1.
Here’s the Python code:
for point in points:
count_all += 1
xy = point.split(',')
x = xy[0]
y = xy[1]
x_squared = float(x) * float(x)
y_squared = float(y) * float(y)
xy_squared = (x_squared + y_squared)
if xy_squared <= 1 :
count_inside += 1
pi = 4 * count_inside / count_all
It is keeping a running total of points inside the circle (“count_inside“) and all points (“count_all“).
Suppose the first point is (1, 1) which is outside of the circle. Using the running total of zero points inside we estimate Pi as 0.000000.
Imagine now the second point is (0, 0), which is inside the circle. Now, our running total is 1 point inside from 2, so we now estimate Pi as (4 * 1) / 2, or 2.000000.
Let’s use a third point (0.5, 0.5), which is also inside the circle. Our running total is 2 points inside from 3 and now we estimate Pi as (4 * 2) / 3, or 2.666666.
Finally, a fourth point (0.6, 0.6), again is inside the circle. The running total of 3 points inside from 4 gives an estimate for Pi of (4 * 3) / 4, or 3.000000.
Ok, so 3.000000 is still a poor estimate for Pi, but it’s better than the previous value 2.666666, which was better than the predecessor 2.000000 and certainly better than 0.000000.
That’s the theory in practice. With each new random input point, the estimate for Pi becomes more accurate.
Refer to the following diagram.
Python code is pushed to Jet to run across multiple nodes
Python code is interpreted, meaning it’s just a collection of files or even more simply can be viewed as a string. Very easy to deploy.
In the diagram, the Python code module is on a host machine in the top left corner. The deploy process streams this Python code from the host machine in the top left corner to one of the Jet host machines in the lower half of the diagram, which in turn duplicates the deployment across all Jet machines.
When the job using the code runs, each Jet instance (a JVM) runs one or more Python virtual machines on that host. Here there are three Jet instances and each of these spins up two Python workers with data passing in and out via open GRPC sockets. So, here we actually have six running Python workers passing data back and forwards with three Jet instances. Hopefully, the Python workers are stable and won’t crash, but if they do Jet won’t fail.
localParallelism
When Jet runs a Python worker in a pipeline stage, there is an optional parameter localParallelism.
This controls how many Python workers will run for each Hazelcast Jet node and will – by default – be derived from the number of CPUs, one worker for each.
In other words, Jet will launch multiple Python workers for the job to maximize usage of the available CPUs and hence processing capacity. Input will be striped by Jet across the available workers.
requirements.txt
Python is an interpreted language and may make reference to libraries such as numpy and pandas.
In standard Python fashion, these libraries should be listed in a requirements.txt file, which Jet will download before executing the Python job.
For faster job start-up, you can pre-install the requirements to the hosts running Hazelcast using pip3. In a Docker environment, you should build these into your Docker image.
Here’s the first attempt as a Jet pipeline to calculate Pi.
Input
A custom input source has been defined, which generates an infinite stream of (X, Y) coordinates into a map.
A Tuple2 is a Jet convenience class to hold a data pair. There is also Tuple3 for a trio of data items. Tuple4 for four data items together.
So a Tuple2 is a map entry, where X is the entry’s key and Y is the entry’s value. We use the map as the input source for our pipeline.
This generates an unlimited number of points, which is what we need for our estimation calculation to become more accurate. However, we need eviction on the target map so we don’t run out of storage space.
Pipeline
We already shared the Python code above. Now, let’s see how Jet runs it.
We define a pipeline that reads from a Map.Entry holding our (X, Y) coordinates. We reformat this as a CSV separated string and pass it into a Python script with the given name.
From the answer that the Python workers give us, we calculate the average of these every 5 seconds and publish this to a topic named “pi“.
Easy!
Visualization
This diagram tries to visualize what is happening:
Job 1 running across 3 nodes publishing to a topic
Here we imagine there might be three nodes, each running one Jet instance.
Each is generating X and Y randomly and in parallel, but these points are passed into the Jet job named “Job 1“.
“Job 1” takes this input locally (ie. for each Jet instance), and passes it into local Python workers. For simplicity, we assume one Python worker per Jet node — more Python workers are better for performance but will make the diagram too complicated.
The Pi value calculated by each Python worker is aggregated/collated on one of the Jet nodes (let’s assume the one of the left) and then broadcast to a topic which is available on all nodes.
Output
The pipeline outputs the result every 5 seconds.
***************************************************************
Topic 'pi' : Job 'Pi1Job' : Value '3.145907828268894'
***************************************************************
So what’s wrong with this?
This approach is not mathematically correct, though you’re a good mathematician if you’ve spotted it before reading this section.
Each Python worker calculates Pi independently and the average is taken. However, the average of the sum of quotients is not equivalent to the quotient of the sum.
Imagine three Python workers and a sequence of four input points, the first three of which lie within the circle.
Two Python workers receive an input point within the circle, each estimates Pi as (4 * 1) / 1, or 4.000000.
One Python worker receives an input point inside and an input point outside the circle, estimating Pi as (4 * 1) / 2, or 2.000000.
The average of the three Python’s workers estimates is (4.000000 + 4.000000 + 2.000000) / 3, or 3.333333. However, for the actual input, the right answer is (4 * 3) / 4, or 3.000000.
The business logic is wrong, it wouldn’t be the first time. Won’t be the last time either.
The code is stateful
The Python code keeps running totals here
count_all += 1
and here
count_inside += 1
for the points processed and number inside the circle.
This is local to the Python worker and not saved anywhere. If the Python worker is restarted, its running totals are lost and begin again from zero.
Jet will stop the Python workers while the cluster changes size (Hazelcast nodes join or leave) and restart them with the input load re-partitioned across the new cluster member nodes.
The first approach is not right. Let’s try again.
Input
The input for this attempt is the same as the input for the first attempt.
We can only meaningfully compare the output from the first and second attempts if they are processing the same stream of random points. If they had different streams of random points, the comparison would be weakened.
Python code
The Python code is slightly amended from the first attempt.
from tribool import Tribool
def handle(points):
results = []
for point in points:
xy = point.split(',')
x = xy[0]
y = xy[1]
x_squared = float(x) * float(x)
y_squared = float(y) * float(y)
xy_squared = (x_squared + y_squared)
if xy_squared <= 1 :
result = Tribool(True)
else :
result = Tribool(False)
Now all we output for each input point is “true” or “false” for whether the input point is inside the circle or not. We don’t keep running totals in the Python worker anymore.
Note also, just for fun “from tribool import Tribool” to import an optional library.
Pipeline
Finally, we have amended the Jet pipeline so that Jet (in Java) maintains the counts of points inside and outside the circle globally from all Python workers concurrently.
Now a Jet job stage is keeping the running totals and everything comes out in a five second window to a topic as before.
This is mathematically better but pushes some of the calculation into the pipeline. That’s not exactly a bother but it does make it a little harder from a support perspective to work out what is happening where in terms of the calculation.
Visualization
Now let’s try this diagram:
Job 2 running across three nodes publishing to a topic
We have added a second job, named with great imagination “Job 2“.
“Job 2” takes the same input source as “Job 1” and again runs in parallel across the Jet nodes. This time, “Job 2” sends its input to different Python workers to calculate Pi, collates on one of the Jet nodes (the right one this time), and publishes to a topic across all nodes.
The diagram now looks a little cluttered. “Job 1” is running across all Jet nodes, running one version of the Python code in Python workers. “Job 2” has been added, also running across all Jet nodes, running a different version of the Python code in different Python workers.
So, in effect, we have two different Python programs, processing the same input source and putting output to the same output sink.
Output
The pipeline outputs the result every five seconds, the job name has changed and obviously the derivation of Pi is different so the output will have varied from the first attempt most likely.
***************************************************************
Topic 'pi' : Job 'Pi2Job' : Value '3.1585789532668387'
***************************************************************
Is the second attempt better than the first attempt?
The second attempt is mathematically correct so theoretically is best. But the business logic is in two places, in Python code and Java code, which makes it harder to follow.
Given enough input, the first attempt will produce a reasonable approximation to Pi, so perhaps it is good enough. Simplicity could be a preference over absolute correctness.
What’s crucial to note here is that you can run both attempts concurrently.
Jet runs each job in a sandbox (or in Java terms, a classloader).
If you have a new version of your processing logic, you can stop the old one and start the new one. That’s a simple approach, and in many ways appealing. However, for some period neither is running and if your job is doing something business-critical that’s no good.
With Jet, you can run both jobs concurrently from the same input. You need some way to distinguish the output in order to decide if the new version is an improvement (here it’s just a publish to a topic with the job variant name). Once you have decided, you can shut down the whichever is performing the worst.
This is a clear improvement in terms of no loss of service, but you have to note that adding an extra job increases the processing load. If your processing time is critical, you should consider temporarily scaling up the cluster size to accommodate the dual processing of some input.
Jet can run Python workers in a pipeline, streaming or batching data into them, spinning up multiple Python workers automatically and adjusting when nodes join or leave the cluster.
All you need to provide Jet with is the location of a Python script. Jet will push it to all nodes in the cluster and run it concurrently for you.
If you want to try this yourself, the code is here.
The Python code here is just a trivial example, but it can be anything, including Machine Learning of course.
If you have learned temporal parallelism used to speed up CPU execution, you came across instruction pipelines aka pipeline processing. In pipeline processing, you will have many instructions in different stages of execution. The term “Data Pipeline” is a misnomer representing a high bandwidth communication channel used for data transportation between a source system and a destination. In certain cases the destination is called a sink. Pipelines by definition allow flow of a fluid automatically from one end to the other end, when one end is connected to a source. The flow of data through a communication channel made people to consider it as a pipeline and the term “Data pipeline” emerged. The source system can be an e-commerce / travel website, or a social media platform. The destination of a data pipeline can be a data warehouse, data lake, a visualization dashboard, or another application like a web-scrapping tool or a recommender system.
With the availability of social media on ubiquitous computing devices, all people in the world have become data entry entry operators. The IOT devices have become another source of continuous data. Consider a single comment on a social media site. The entry of this comment could generate data to feed a real-time report counting social media mentions, a sentiment analysis application that outputs a positive, negative, or neutral result. Though the data is from the same source in all cases, each of these applications are built on unique data pipelines that must smoothly complete before the end user sees the result. Hence data pipelines are one to many connections depending on the number of applications consuming the data. Common processing steps in data pipelines include data transformation, augmentation, enrichment, filtering, grouping, aggregating, and running of machine learning algorithms against that data.
Data pipelines have become a necessity for today’s data-driven enterprise to handle big data. We all know that Volume, Variety and Velocity are the key attributes of big data. Big data pipelines are built to accommodate one or more of these attributes efficiently. Let us take the case the first attribute of volume. The volume attribute is handled differently in the case of pipelines handling real-time stream data and batch data. The velocity of big data makes it necessary to build real-time streaming data pipelines for big data. The data can be captured and processed in real time to enable quick decisions on solutions like recommender systems. The volume attribute requires that data pipelines are scalable in capacity as the data volume can be variable over time. The big data pipeline must be able to scale in capacity to handle significant volumes of data concurrently. The variety attribute of big data requires that big data pipelines be able to recognize and process data in many different formats: structured, unstructured, and semi-structured.
Data pipelines are useful for businesses relying on large volumes of data arriving from multiple sources. Depending on the nature of usage of the data, the data pipelines are broadly classified into Real-Time, Batch, and Cloud native. Sometimes the data needs to be processed in real-time for systems in which sub-second decision making is required. Batch mode addresses the volume attribute of big data and these data pipelines are used when large volumes of data is to be processed at regular intervals. You can store the batch data in data tanks until they get processed and emptied. There can be multiple data tanks implemented in batch mode data pipelines. The Cloud native data pipelines are designed to work with cloud-based data by creating complex data processing workloads. For example, AWS Data Pipeline is a web service that easily automates and transforms data.
In Real-Time data pipelines, the data flows as and when it arrives. This type of pipelines addresses the velocity attribute of big data. There could be difference in the rate at which the data arrives and it is getting consumed. To take care of the mismatch in the rates, we need to implement queuing and buffering systems in the data pipeline. A commonly used tool is Apache Kafka, a messaging queue based event streaming platform. Kafka works on publish subscribe mode and ensures that the messages are queued in the order in which they arrive and delivered in the same order with high reliability. Kafka buffers the messages in memory for quick delivery.
Another term closely related with real-time data pipelines is stream computing. The term stream computing means pulling in streams of data in a single flow. Stream computing uses software algorithms that analyzes the data in real time as it streams in to increase speed and accuracy. A simple example of stream computing is graphics processing implemented using GPUs for rendering images on your computer screen. Other examples are data from a streaming sources such as financial markets or telemetry from connected devices. When stream computing does processing in real time, ETL tools have been used for processing workloads in batch. With the evolution of data pipe lines, a new breed of streaming ETL tools are emerging for real-time transformation of data.
Lambda architecture is a data-processing architecture evolved to meet the requirements of big data processing. It is designed to take advantage of both batch and stream-processing methods. This approach attempts to reduce latency using batch processing to provide accurate views of batch data simultaneously using real-time stream processing to provide views of online data. Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. The batch layer pre-computes results from archived data using distributed processing frameworks like Hadoop MapReduce and Apache Spark that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. The speed layer processes data streams in real time and it sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. The speed layer is responsible for filling the gap caused by the batch layer’s delay in providing views based on the most recent data. This layer’s views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer’s views for the same data become available. The two view outputs gets joined through a T junction in the data pipeline to the presentation layer in which the reports are generated . In Lambda architecture, real-time data pipelines merge with batch processing for decision making based on latest data.
The design of data pipeline architectures require many considerations based on the usage scenarios. For example, does your pipeline need to handle streaming data? If so, what rate of data do you expect? How much and what types of processing need to happen in the data pipeline? Is the data being generated in the cloud or on-premises, and where does it need to go? Do you plan to build the pipeline with micro services? Are there specific technologies which you can leverage for implementation ?
As live streaming continues to represent a huge portion of all internet traffic, leading content providers—Phenix, Net Insight, and Mynet Inc.,—have selected Oracle Cloud Infrastructure to support growing global demand for streaming content.
According to Go-Globe, live streaming is expected to account for 82 percent of all internet traffic this year, and the leading global providers of streaming content all have one thing in common: they need infrastructure to securely store and quickly access huge catalogs of digital content. As the popularity of streaming content continues to grow, providers are looking for technology partners to support their global growth and massive data demands.
“Oracle’s modern, second-generation cloud is built and optimized to help enterprises run their most demanding workloads, including digital content, swiftly and securely,” said Vinay Kumar, vice president of product management at Oracle. “We architected our cloud specifically for low latency and high, consistent performance, making it an ideal platform for live streaming. We are seeing companies around the globe take advantage of our unique architecture and low networking costs to reach their customers in real time.”
Phenix
During commercial breaks at last year’s Oscars, Phenix provided the real-time “audience cam” capturing top celebrities in informal moments chatting and socializing amongst themselves via a stream from the Dolby Theater to a global audience watching online. ABC employed Phenix’s high availability workflows including active-active encoding and multi-path ingest over two independent internet connections for the highest reliability. At the peak, Phenix streamed to 110,000 concurrent viewers at a stream join rate of more than 80,000 per second, totaling 1,400,000 viewers with an end-to-end latency of less than 1⁄2 second.
Phenix sought a cloud infrastructure solution that could deliver uninterrupted video streams that are tailored to the internet connection speed and quality of millions of end-user devices. Ultimately Phenix choose Oracle Cloud Infrastructure because of Oracle’s Global enterprise focused presence and exceptional customer service and support. Phenix needed a partner that would provide leading performance for virtual cloud infrastructure and a high degree of transparency.
“Most streaming providers today don’t deliver an optimal user experience due to rebuffering, latency, and audience drift. We rely on Oracle Cloud Infrastructure for rapid scalability when an event suddenly gains thousands of new viewers. Phenix enables users to watch real-time content synchronously regardless of the device, operating system, or network connection quality,” said Dr. Stefan Birrer, Chief Software Architect and co-founder, Phenix. “The Oracle Cloud Infrastructure platform meets our challenging requirements of performance and scale, has increased the efficiency of workloads up to 40 percent and decreased networking costs by 70 percent. This allows us to pass significant savings on to our customers.”
Net Insight
Emmy Award-winning Net Insight facilitates television streaming for the Olympics, Super Bowl, Oscars, and other major live events, and offers people and resource management for TV streaming projects during these events. The company was eager to offer its platform as a service to its customers, but faced a monumental obstacle with more than 20 years of data stored in customer-owned warehouses.
“We needed a cloud solution that could support our goal of making our platform available as a service, so we could provide our customers with more complete analysis and real insights from all the scheduling data they gather,” said Crister Fritzson, CEO, Net Insight. “The combination of Oracle Autonomous Database and Oracle Analytics Cloud provided the most complete, clean offering on the market. Not only are we able to securely house that 20 years of data, we now have the ability to gain critical business insights from it in real time. In addition to offering a superior solution—no other database self-heals, self-tunes and self-secures—the high level of customer support provided by Oracle played a critical role at every level of the entire project.”
Mynet
Mynet, a game service business that specializes in online game management for over 60 titles since founded, aims to provide gamers with an “exciting space over a longer time.” With a business model based on working with game makers to create a longer life for online titles, Mynet sought an online game operation platform for “Age of Ishtaria,” a beautifully-illustrated, fast-paced, action battle role-playing game (RPG).
“A stable environment is essential in online game operations, and providing a space where users can enjoy the game over a long period of time is a huge benefit to our customers,” said Mr. Yuki Horikoshi, general manager of engineering, Mynet Inc. “Oracle Cloud Infrastructure was the best balance between performance and cost for ‘Age of Ishtaria,’ enabling players to gain a better user experience while also delivering 65 percent cost savings compared to previous cloud service. Those savings then allow us to reinvest in new events and campaigns for our customers.”
Oracle’s Gen 2 Cloud
Oracle’s modern, second-generation cloud is built and optimized specifically to help enterprises run their most demanding workloads securely. With unique architecture and capabilities, Oracle Cloud delivers unmatched security, performance, and cost savings. Oracle’s Generation 2 Cloud is the only one built to run autonomous services, including Oracle Autonomous Linux and Oracle Autonomous Database, the industry’s first and only self-driving database. Oracle Cloud offers a comprehensive cloud computing portfolio, from application development and business analytics to data management, integration, security, artificial intelligence (AI), and blockchain.
Cutting-edge companies are turning to artificial intelligence and machine learning to meet the challenges of the new digital business transformation era.
According to Gartner: “Eighty-seven percent of senior business leaders say digitalization is a company priority and 79% of corporate strategists say it is reinventing their business—creating new revenue streams in new ways“.
But so far, digital change has been challenging. The complexity of the tools, architecture, and environment create barriers to using machine learning. Using SQL-based relational data management to store and perform data exploration of images reduces the barriers and unlocks the benefits of machine learning.
This blog post demonstrates using popular open source tools MariaDB Server, TensorFlow Python library, and Keras neural-network library to simplify the complexity of implementing machine learning. Using these technologies can help you accelerate your time-to-market by efficiently accessing, updating, inserting, manipulating and modifying data.
Machine Learning on Relational Databases
At the center of the digital business transformation enabled by machine learning are technologies such as chatbots, recommendation engines, personalized communications, intelligent advertisement targeting, and image classification.
Image classification has a wide variety of use cases—from law enforcement and the military, to retail and self driving cars. When implemented with machine learning, image classification can provide real-time business intelligence. The objective of image classification is to identify and portray, as a unique gray level (or color), the features occurring in an image. The most common tools for image classification are TensorFlow and Keras.
TensorFlow is a Python library for fast numerical computing created and released by Google. MariaDB Server is an open source relational database with a SQL interface for accessing and managing data. Keras is an open-source neural-network library written in Python.
In this post, you will discover how to test image classification by enabling interoperability between TensorFlow and MariaDB Server. This post uses the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels).
Loading and preparing the data into MariaDB Server is outside the scope of this post. The following tables have been created and populated in advance with the Fashion MNIST dataset.
The following libraries are used to perform basic data exploration with MariaDB Server:
The io module provides Python’s main facilities for dealing with various types of I/O.
Matplotlib is a Python 2D plotting library to produce a variety of graphs across platforms.
Pandas offers data structures and operations for manipulating numerical tables and time series.
The pymysql package contains a pure-Python client library to access MariaDB Server.
Let’s start by connecting to the database server through Python:
import io
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd
import pymysql as MariaDB
%matplotlib inline
conn = MariaDB.connect( host = '127.0.0.1'
, port = 3306
, user = 'mdb'
, passwd = 'letmein'
, db = 'ml'
, charset = 'utf8')
cur = conn.cursor()
Once connected to the MariaDB server, the images in the database can be easily accessed and managed. All the images used for training and testing the model are stored in a single table (tf_images). How the image will be used is defined in the image use table (img_use). In this case, the table has only two tuples, training and testing:
sql="SELECT use_name AS 'Image Role'
, use_desc AS 'Description'
FROM img_use"
display( pd.read_sql(sql,conn) )
Image Role
Description
Training
The image is used for training the model
Testing
The image is used for testing the model
Mapping target attributes to image objects in a dataset is called labeling. The label definition varies from application to application, and there is hardly a universal definition of what a “correct” label is for an image. Using a relational database simplifies the labeling process and provides a way of going from coarse to fine grain labels.
In this example, using the “categories” table, an image has only one label (coarse) as shown below:
sql="SELECT class_name AS 'Class Name' FROM categories"
display( pd.read_sql(sql,conn) )
Class Name
0
T-shirt/top
1
Trouser
2
Pullover
3
Dress
4
Coat
5
Sandal
6
Shirt
7
Sneaker
8
Bag
9
Ankle boot
The images table contains all the images to be used for training and testing. Each image has a unique identifier, a label, and whether it is used for training or testing the model. The images are stored in their original PNG format and as pre-processed floating point tensors. A simple inner join on this table can be executed to display the image representations (vector or png format), its label, and the intended usage.
sql="SELECT cn.class_name AS 'Class Name'
, iu.use_name AS 'Image Use'
, img_vector AS 'Vector Representation'
, img_blob AS 'Image PNG'
FROM tf_images AS ti
INNER JOIN categories AS cn ON ti.img_label = cn.class_idx
INNER JOIN img_use AS iu ON ti.img_use = iu.use_id
LIMIT 5"
display( pd.read_sql(sql,conn) )
Class Name
Image Use
Vector Representation
Image PNG
Ankle boot
Training
b’x80x02cnumpy.core.multiarrayn_reconstruct…
b’x89PNGrnx1anx00x00x00rIHDRx00x00…
T-shirt/top
Training
b’x80x02cnumpy.core.multiarrayn_reconstruct…
b”x89PNGrnx1anx00x00x00rIHDRx00x00…
T-shirt/top
Training
b’x80x02cnumpy.core.multiarrayn_reconstruct…
b’x89PNGrnx1anx00x00x00rIHDRx00x00…
Dress
Training
b”x80x02cnumpy.core.multiarrayn_reconstruct…
b’x89PNGrnx1anx00x00x00rIHDRx00x00…
T-shirt/top
Training
b’x80x02cnumpy.core.multiarrayn_reconstruct…
b’x89PNGrnx1anx00x00x00rIHDRx00x00…
Using SQL statements makes the data exploration process easy. For example, the SQL statement below shows the image distribution by image label.
sql="SELECT class_name AS 'Image Label'
, COUNT(CASE WHEN img_use = 1 THEN img_label END) AS 'Training Images'
, COUNT(CASE WHEN img_use = 2 THEN img_label END) AS 'Testing Images'
FROM tf_images INNER JOIN categories ON class_idx = img_label
GROUP BY class_name"
df = pd.read_sql(sql,conn)
display (df)
ax = df.plot.bar(rot=0)
Image Label
Training Images
Testing Images
0
Ankle boot
6000
1000
1
Bag
6000
1000
2
Coat
6000
1000
3
Dress
6000
1000
4
Pullover
6000
1000
5
Sandal
6000
1000
6
Shirt
6000
1000
7
Sneaker
6000
1000
8
T-shirt/top
6000
1000
9
Trouser
6000
1000
There are 6,000 images for each label in the training set and 1,000 images for each label in the testing set. There are 60,000 total images in the training set and 10,000 total images in the testing set.
Individual articles of clothing are stored as low resolution images. Since the database can store those images efficiently as Binary Large OBjects (BLOBs) it is very easy to retrieve an image using SQL, as shown below:
sql="SELECT img_blob
FROM tf_images INNER JOIN img_use ON use_id = img_use
WHERE use_name = 'Testing' and img_idx = 0"
cur.execute(sql)
data = cur.fetchone()
file_like=io.BytesIO(data[0])
img = mpimg.imread(file_like)
plt.imshow(img)
above: image from fashion_mnist dataset
This first part of the blog series has demonstrated how a relational database can be used to store and perform data exploration of images using simple SQL statements. Part 2 will show how to format the data into the data structures needed by TensorFlow, and then how to train the model, perform predictions (i.e., identify images) and store those predictions back into the database for further analysis or usage.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Despite the complex-sounding name, probabilistic data structures are relatively simple data structures that can be very useful for solving streaming-analytics problems at scale.
Probabilistic data structures make heavy use of concepts from information theory (such as hashing) to give you an approximate view of some characteristics of your data. The reason why you might be fine with an approximation versus a complete and precise description is because in exchange they save an enormous amount of storage space and deliver great performance compared to traditional data structures.
Streaming analytics
One situation where performance and space savings are extremely important is streaming analytics. Streaming analytics, as opposed to batch analytics, are about providing insight on a dataset of unbounded size that gets constantly streamed into the analytics engine. The unbounded nature of those datasets precludes the usage of some traditional approaches, making probabilistic data structures an invaluable tool in a data engineer’s toolbelt.
Let’s take a look at some practical examples where Redis can help you perform analytics at a huge scale using probabilistic data structures. In the following examples we will see HyperLogLog, which is part of Redis, and other data structures (Bloom filters, TopK, Count-min Sketch) which are available in RedisBloom, a Redis module developed by Redis Labs.
HyperLogLog
HyperLogLog (HLL) counts unique items in a stream, without having to remember the whole history. In fact, an HLL counter takes up to 12KB of memory regardless of how many items you put into it. Lastly, an HLL counter will have a standard error rate of 0.81%, which is a perfectly sustainable error rate for most streaming analytics use cases.
Adding new items
To add new items to an HLL counter you must use PFADD:
PFADD home-uniques "73.23.4.3" "185.23.54.8" "user1@foo.com" "user2@bar.com" In the above example, we created an HLL counter to count unique views of a page on our website and added four entries to it: two anonymous users identified by IP and two known users identified by their email address.
To the HLL counter, all elements are strings without any specific meaning, so it’s up to you to give them the right structure. As an example, we might want to count unique pageviews per day, in which case we could structure our elements like this:
This way, two views from the same user on different days will produce different elements. Keep in mind that the length of each element won’t affect the amount of space consumed by the counter because each entry will get hashed before insertion. The same applies for the data structures presented next.
Counting
To get the count from an HLL counter you must use PFCOUNT:
PFCOUNT home-uniques
This command will return the total number of unique items present in the filter. More interestingly, when called on multiple counters at once, PFCOUNT will perform set union before counting, meaning that any element present in more than one counter will be counted only once.
PFCOUNT blogpost1-uniques blogpost2-uniques
This command will return the unique views counted from the union of the two counters. This is useful because by dividing that number by the sum of the individual unique pageviews you will also get a measure of correlation between the two pages.
HLL counters also support an operation called PFMERGE, which is basically the same operation that PFCOUNT performs when called on multiple counters (i.e., set union).
Bloom filters
Bloom filters answer set membership questions (i.e. “is this element part of the set?”). They also deliver extreme space savings but, unlike HLL, the amount of space they require depends on the number of elements you plan to add to them and the error rate you are willing to accept. To give you a rough idea, a Bloom filter sized for 1 million items (of any size) with a 1% error rate takes up approximately 1MB of storage space.
About the error rate
In HLL, the error rate is manifested in the total count being slightly off compared to the actual number. In Bloom filters, the error probability affects positive answers. In other words, when you ask a Bloom filter if it contains a given element, the answer is either “Probably yes” or “Definitely no.”
Usage examples
Bloom filters can be used to answer such questions as:
Is this URL malicious?
Is this word contained in the document?
Has this URL already been crawled?
Was this entry already present in the stream?
Creating a Bloom filter
To create a Bloom filter with specific settings you must use BF.RESERVE:
BF.RESERVE banned-urls 0.001 100000
This command creates a filter with a 0.1% error rate for 100K elements. The command takes a few more optional arguments about auto-scaling. Auto-scaling is a RedisBloom feature that adds more memory to the filter automatically when the capacity limit is reached. We still have to specify a target capacity because there are performance implications behind the auto-scaling feature, meaning that you want to get the initial estimate right whenever possible and rely on auto-scaling only if necessary.
Adding elements
To add elements to a Bloom filter you must use BF.ADD or BF.MADD (to add multiple elements at once):
BF.ADD crawled-urls google.com facebook.com Very straightforward. If you want to add multiple elements at once, take a look at BF.MADD, and if you want to skip the BF.RESERVE step, you can either configure the default size for all filters, or use BF.INSERT.
Testing set membership
To test if an element is part of the set, you must use BF.EXISTS or BF.MEXISTS (to test multiple elements at once):
BF.MEXISTS crawled-urls google.com reddit.com
Testing membership is very fast, but this is where the auto scaling functionality can have a negative impact if overused. Every time the filter is extended, it needs to look for the item in more alternative locations. Each check still happens very quickly, but choosing a bad base size might require the filter to scale up enough times to impact the performance of this command.
Deleting elements
Bloom filters don’t support deleting an element once it’s added to a filter, but the good news is that RedisBloom also includes a Cuckoo filter, an in-place replacement for the Bloom filter that also supports item deletion.
Count-min Sketch
Count-min Sketch (CM sketch) answers item frequency questions. In some ways CM sketch is similar to HyperLogLog as they both count elements, with the crucial difference that CM sketch doesn’t count unique elements, but rather how many copies of each element you added to it.
Examples of questions that CM sketch can answer:
Is this user making too many requests?
How common is this word in the document?
And more generally, is this element a “heavy hitter”?
As in the previous examples, there is some imprecision involved. In the case of CM sketch, the issue is that it always overestimates the counts. The degree of overestimation depends on the options that you specify when creating the counter.
TopK is basically a regular heap attached to a probabilistic frequency counter like Count-min Sketch. The result is that TopK will use the frequency estimates to keep in memory only the “top K” elements, for a configurable value of K.
If CM sketch was able to tell you the frequency of a given element, TopK is also able to return the elements themselves, if they’re frequent enough. Of all the data structures described here, this is the only one able to return (some of) the elements that you put in it. As such, the size of those elements matters in terms of space usage.
To learn more about streaming analytics with probabilistic data structures, check out RedisBloom, read the documentation, and spin up a Docker container to play with it. If you’re looking for ways to integrate Redis it on your streaming analytics pipelines, take a look at Redis Streams.
Hazelcast Jet allows you to distribute stream processing over several cluster nodes. While it comes with several out-of-the-box sources to read from (Hazelcast IMap, JMS, JDBC, etc.), and sinks to write to, there’s no Java 8 streams source. In this post, we are going to create an adapter to bridge this gap.
A Look at the SourceBuilder API
The com.hazelcast.jet.pipeline.Sources class contains the available sources mentioned above. That should be your first stop because nobody wants to reinvent the wheel.
To create a custom source, the entry point is the SourceBuilder class. It contains all the necessary plumbing to create one such source.
The usage is quite straightforward:
SourceBuilder
.batch("name", new CreateFunction())
.fillBufferFn(new FillBufferFunction())
.build();
This deserves some explanation. The batch() method requires two arguments: a name, as well as a placeholder where the state can safely be stored. Remember, Jet is distributed; hence the state cannot be kept in places that Jet doesn’t know. That prevents the usage of fields of “standard” objects.
The fillBufferFn() method requires a single argument, a function that will read data and put said data in a buffer for Jet’s consumption.
Finally, notice the slight difference between the names used in the API and the names of the standard Java 8 functional interfaces, e.g., BiConsumerEx instead of BiConsumer. The latter cannot be used, as regular functional interfaces are not serializable. Most parameters in Jet methods need to be sent over the wire; hence they need to implement Serializable.
It’s easy to move from the batch API to the streaming API. It requires two changes:
To use the stream() method instead of batch()
To add a timestamp to each element before calling build(). Several methods are available.
Implementing the Adapter
With that understanding, let’s implement the adapter.
The create function
The create function is a FunctionEx that accepts a Processor.Context type and returns any type.
Remember that the latter should contain the state. We could implement a cursor over the stream to keep track of what was the last element read. Yet this is the textbook definition of an Iterator!
The first draft looks like the following:
public class Java8StreamSource implements FunctionEx> {
private final Stream stream;
public Java8StreamSource(Stream stream) {
this.stream = stream;
}
@Override
public Iterator applyEx(Processor.Context context) {
return stream.iterator();
}
}
Unfortunately, this is not good enough. The main issue with the code is that Java’s Stream is not Serializable! Hence, Jet won’t be able to send it over the wire. Even if it started in embedded mode on a single node with no network hops involved, Jet would throw an exception at runtime to avoid any unwanted surprise in the future.
We have to first collect elements in a serializable collection, such as ArrayList. But this cannot be achieved if the stream is of unbounded size—we need to keep the size within limits. This results in the following improved code:
public class Java8StreamSource implements FunctionEx> {
private final List elements;
public Java8StreamSource(Stream stream, int limit) {
this.elements = stream.limit(limit).collect(Collectors.toList());
}
@Override
public Iterator applyEx(Processor.Context context) {
return elements.iterator();
}
}
The fill function
The fill function is bi-consumer of the state object—the iterator, and the buffer to fill provided by Jet.
Let’s try that:
public class Java8StreamFiller implements BiConsumerEx, SourceBuffer> {
@Override
public void acceptEx(Iterator iterator, SourceBuffer buffer) {
while (iterator.hasNext()) {
buffer.add(iterator.next());
}
}
}
That’s pretty good if the iterator contains a limited number of items. Barring that, we could potentially overflow the buffer. To prevent overflow, let’s add a limit to the number of elements added in one call:
public class Java8StreamFiller implements BiConsumerEx, SourceBuffer> {
@Override
public void acceptEx(Iterator iterator, SourceBuffer buffer) {
for (var i = 0; i < Byte.MAX_VALUE && iterator.hasNext(); i++) {
buffer.add(iterator.next());
}
}
}
Byte.MAX_VALUE has the benefits of being quite low.
Putting it all together
The final code looks like the following:
public static void main(String[] args) {
var stream = Stream.iterate(1, i -> i + 1);
var pipeline = Pipeline.create();
var batch = SourceBuilder
.batch("java-8-stream", new Java8StreamSource<>(stream))
.fillBufferFn(new Java8StreamFiller<>())
.build();
pipeline.drawFrom(batch)
.drainTo(Sinks.logger());
var jet = Jet.newJetInstance(new JetConfig());
try {
jet.newJob(pipeline).join();
} finally {
jet.shutdown();
}
}
Improving the Draft Implementation
While working “in general,” the above code suffers from a huge limitation: it doesn’t transfer the Stream itself over the wire – because Stream is not Serializable but a list of the elements that belong to the stream. With this approach, it’s currently not possible to cope with infinite streams which is the reason for the limit parameter in the constructor of Java8StreamSource. Surely, we can do better!
Actually, we can. Like in many cases, we can wrap the stream in a SupplierEx. This is a Hazelcast specific Supplier that also inherits from Serializable.
The updated code looks like this:
public class Java8StreamSource implements FunctionEx> {
private final SupplierEx> supplier;
public Java8StreamSource(SupplierEx> supplier) {
this.supplier = supplier;
}
@Override
public Iterator applyEx(Processor.Context context) {
return stream.get().iterator();
}
}
The pipeline just needs to be updated accordingly:
SourceBuilder
.batch("java-8-stream", new Java8StreamSource<>(() -> Stream.iterate(1, i -> i + 1)))
.fillBufferFn(new Java8StreamFiller<>())
.build();
Notice how the stream is wrapped in a lambda on line 2.
Conclusion
In this post, we gave a quick glance at the extensibility of Jet by creating a custom source. The source wraps a Java 8 stream and makes it available for Jet to consume. The above example can easily be adapted to your context so that you can create your own sources (and sinks). Happy integration!
The source code for this post is available on Github.
PostgreSQL and MongoDB are two popular open source relational (SQL) and non-relational (NoSQL) databases available today. Both are maintained by groups of very experienced development teams globally and are widely used in many popular industries for adminitration and analytical purposes. MongoDB is a NoSQL Document-oriented Database which stores the data in form of key-value pairs expressed in JSON or BSON; it provides high performance and scalability along with data modelling and data management of huge sets of data in an enterprise application. PostgreSQL is a SQL database designed to handle a range of workloads in many applications supporting many concurrent users; it is a feature-rich database with high extensibility, which allows users to create custom plugins, extensions, data types, common table expressions to expand existing features
I have recently been involved in the development of a MongoDB Decoder Plugin for PostgreSQL, which can be paired with a logical replication slot to publish WAL changes to a subscriber in a format that MongoDB can understand. Basically, we would like to enable logical replication between MongoDB (as subscriber) and PostgreSQL (as publisher) in an automatic fashion. Since both databases are very different in nature, physical replication of WAL files is not applicable in this case. The logical replication supported by PostgreSQL is a method of replicating data objects changes based on replication identity (usually a primary key) and it would be the ideal choice for this purpose as it is designed to allow sharing the object changes between PostgreSQL and multiple other databases. The MongoDB Decoder Plugin will play a very important role as it is directly responsible for producing a series of WAL changes in a format that MongoDB can understand (ie. Javascript and JSON).
In this blog, I would like to share some of my initial research and design approach towards the development of MongoDB Decoder Plugin.
2. Architecture
Since it is not possible yet to establish a direct logical replication connection between PostgreSQL and MongoDB due to two very different implementations, some kind of software application is ideally required to act as a bridge between PostgreSQL and MongoDB to manage the subscription and publication. As you can see in the image below, the MongoDB Decoder Plugin associated with a logical replication slot and the bridge software application are required to achieve a fully automated replication setup.
Unfortunately, the bridge application does not exist yet, but we do have a plan to develop such application in near future. So, for now, we will not be able to have a fully automated logical replication setup. Fortunately, we can utilize the existing pg_recvlogical front end tool to act as a subscriber of database changes and publish these changes to MongoDb in the form of output file, as illustrated below.
With this setup, we are able to verify the correctness of the MongoDB Decoder Plugin output against a running MongoDB in a semi-automatic fashion.
3. Plugin Usage
Based on the second architecture drawing above without the special bridge application, we expect the plugin to be used in similar way as normal logical decoding setup. The Mongodb Decoder Plugin is named wal2mongo as of now and the following examples show the envisioned procedures to make use of such plugin and replicate data changes to a MongoDB instance.
First, we will have to build and install wal2mongo in the contrib source folder and start a PostgreSQL cluster with the following parameters in postgresql.conf. The wal_level = logical tells PostgreSQL that the replication should be done logically rather than physically (wal_level = replica). Since we are setting up replication between 2 very different database systems in nature (PostgreSQL vs MongoDB), physical replication is not possible. All the table changes will be replicated to MongoDB in the form of logical commands. max_wal_senders = 10 limits the maximum number of wal_sender proccesses that can be forked to publish changes to subscriber. The default value is 10, and is sufficient for our setup.
wal_level = logical
max_wal_senders = 10
On a psql client session, we create a new logical replication slot and associate it to the MongoDB logical decoding plugin. Replication slot is an important utility mechanism in logical replication and this blog from 2ndQuadrant has really good explaination of its purpose: (https://www.2ndquadrant.com/en/blog/postgresql-9-4-slots/)
$ SELECT * FROM pg_create_logical_replication_slot('mongo_slot', 'wal2mongo');
where mongo_slot is the name of the new logical replication slot and wal2mongo is the name of the logical decoding plugin that you have previously installed in the contrib folder. We can check the created replication slot with this command:
$ SELECT * FROM pg_replication_slots;
At this point, the PostgreSQL instance will be tracking the changes done to the database. We can verify this by creating a table, inserting or deleting some values and checking the change with the command:
$ SELECT * FROM pg_logical_slot_get_changes('mongo_slot', NULL, NULL);
Alternatively, one can use pg_recvlogical front end tool to subscribe to the created replication slot, automatically receives streams of changes in MongoDB format and outputs the changes to a file.
Once initiated, pg_recvlogical will continuously stream database changes from the publisher and output the changes in MongoDB format and in mongodb.js as output file. It will continue to stream the changes until user manually terminates or the publisher has shutdown. This file can then be loaded to MongoDB using the Mongo client tool like this:
$ mongo < mongodb.js
MongoDB shell version v4.2.3
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("39d478df-b8ca-4030-8a05-0e1ebbf6bc44") }
MongoDB server version: 4.2.3
switched to db mydb
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
WriteResult({ "nInserted" : 1 })
bye
Both databases use different terminologies to describe the data storage. Before we can replicate the changes of PostgreSQL objects and translate them to MongoDB equivalent, it is important to gain clear understanding of the terminologies used on both databases. The table below is our initial terminology mappings:
PostgreSQL Terms
MongoDB Terms
MongoDB Description
Database
Database
A physical container for collections
Table
Collection
A grouping of MongoDB documents, do not enforce a schema
Row
Document
A record in a MongoDB collection, can have difference fields within a collection
Column
Field
A name-value pair in a document
Index
Index
A data structure that optimizes queries
Primary Key
Primary Key
A record’s unique immutable identified. The _id field holds a document’s primary key which is usually a BSON ObjectID
Transaction
Transaction
Multi-document transactions are atomic and available in v4.2
5. Supported Change Operations
Our initial design of the MongoDB Decoder Plugin is to support database changes caused by clauses “INSERT”, “UPDATE” and “DELETE”, with future support of “TRUNCATE”, and “DROP”. These are few of the most common SQL commands used to alter the contents of the database and they serve as a good starting point. To be able to replicate changes caused by these commands, it is important that the table is created with one or more primary keys. In fact, defining a primary key is required for logical replication to work properly because it serves as replication identity so the PostgreSQL can accurately track a table change properly. For example, if a row is deleted from a table that does not have a primary key defined, the logical replication process will only detect that there has been a delete event, but it will not be able to figure out which row is deleted. This is not what we want. The following is some basic examples of the SQL change commands and their previsioned outputs:
$ BEGIN;
$ INSERT INTO table1(a, b, c) VALUES(1, 'Cary', '2020-02-01');
$ INSERT INTO table1(a, b, c) VALUES(2, 'David', '2020-02-02');
$ INSERT INTO table1(a, b, c) VALUES(3, 'Grant', '2020-02-03');
$ UPDATE table1 SET b='Cary';
$ UPDATE table1 SET b='David' WHERE a = 3;
$ DELETE FROM table1;
$ COMMIT;
The simple SQL commands above can be translated into the following MongoDB commands. This is a simple example to showcase the potential input and output from the plugin and we will introduce more blogs in the near future as the development progresses further to show case some more advanced cases.
A write operation in MongoDB is atomic on the level of a single document, and since MongoDB v4.0, multi-document transaction control is supported to ensure the atomicity of multi-document write operations. For this reason, the MongoDB Deocoder Plugin shall support 2 output modes, normal and transaction mode.
In normal mode, all the PostgreSQL changes will be translated to MongoDB equivalent without considering transactions. In other words, users cannot tell from the output if these changes are issued by the same or different transactions. The output can be fed directly to MongoDB, which can gurantee certain level of atomicity involving the same document
Since MongoDB v4.0, there is a support for multi-document transaction mechanism, which acts similarly to the transaction control in PostgreSQL. Consider a normal insert operation like this with transaction ID = 500 within database named “mydb” and having cluster_name = “mycluster” configured in postgresql.conf:
$ BEGIN;
$ INSERT INTO table1(a, b, c)
VALUES(1, 'Cary', '2020-02-01');
$ INSERT INTO table1(a, b, c)
VALUES(2, 'Michael', '2020-02-02');
$ INSERT INTO table1(a, b, c)
VALUES(3, 'Grant', '2020-02-03');
$ COMMIT;
Please note that the session variable used in the MongoDB output is composed of the word session concatenated with the transaction ID and the cluster name. This is to gurantee that the variable name will stay unique when multiple PostgrSQL databases are publishing using the same plugin towards a single MongoDB instance. The cluster_name is a configurable parameter in postgresql.conf that is used to uniquely identify the PG cluster.
The user has to choose the desired output modes between normal and transaction depending on the version of the MongoDB instance. MongoDB versions before v4.0 do not support multi-document transaction mechanism so user will have to stick with the normal output mode. MongoDB versions after v4.0 have transaction mechanism supported and thus user can use either normal or transaction output mode. Generally, transaction output mode is recommended to be used when there are multiple PostgreSQL publishers in the network publishing changes to a single MongoDB instance.
7. Data Translation
PostgreSQL supports far more data types than those supported by MongoDB, so some of the similar data types will be treated as one type before publishing to MongoDB. Using the same database name, transaction ID and cluster name in previous section, the table below shows some of the popular data types and their MongoDB transaltions.
MongoDB has gained a lot of popularity in recent years for its ease of development and scaling and is ideal database for data analytic purposes. Having the support to replicate data from multiple PostgreSQL clusters to a single MongoDB instance can bring a lot of value to industries focusing on data analytics and business intelligence. Building a compatible MongoDB Decoder Plugin for PostgreSQL is the first step for us and we will be sharing more information as development progresses further. The wal2mongo project is at WIP/POC stage and current work can be found here: https://github.com/HighgoSoftware/wal2mongo.
Cary is a Senior Software Developer in HighGo Software Canada with 8 years of industrial experience developing innovative software solutions in C/C++ in the field of smart grid & metering prior to joining HighGo. He holds a bachelor degree in Electrical Engineering from University of British Columnbia (UBC) in Vancouver in 2012 and has extensive hands-on experience in technologies such as: Advanced Networking, Network & Data security, Smart Metering Innovations, deployment management with Docker, Software Engineering Lifecycle, scalability, authentication, cryptography, PostgreSQL & non-relational database, web services, firewalls, embedded systems, RTOS, ARM, PKI, Cisco equipment, functional and Architecture Design.
As we draw ever closer to RedisConf 2020 Takeaway starting May 12, we want to hear from you about the cool things you use Redis for. The Redis community is one of the most inspiring and creative groups of developers in the world—and we want to showcase the community’s achievements. That’s why we’re excited to announce the Rediscover Redis Competition!
The Rediscover Redis Competition
We’re challenging members of the Redis community to show off how they have rediscovered Redis. Almost every developer knows how to use Redis as a cache to build top-performing applications. But not everyone is leveraging Redis to its fullest potential.
We’re collecting stories from the Redis community to highlight examples of Redis’ versatility and demonstrate how Redis is behind some of our most successful applications. And you can get some pretty sweet prizes from participating—the top three submissions get their projects featured in the keynote at RedisConf Takeaway and a Valve Index VR Kit (a $999 value)!
So what do you have to do? With this year’s Rediscover Redis theme at RedisConf Takeaway, we want you to submit your project explaining how you have “rediscovered” Redis and taken advantage of more of its limitless potential.
While you can submit your project in a variety of forms, we’re encouraging video submissions—with every valid submission we receive, Redis Labs will donate $100 to the Feeding America COVID-19 Response Fund, a national food-raising effort supporting food banks, children out of school, and people out of work during this difficult time.
Name, email, company, project description, and link to your project if possible.
Optional demo videos must be linked from a platform like YouTube. There will be no option to upload directly to the site.
Here’s what we’re grading on:
Project must implement Redis for a use case beyond caching—be sure to explain how in the description.
Entries will be judged by their innovative and extensive use of Redis capabilities beyond cache.
Extra consideration given to submissions that describe business value achieved and include videos.
Note: Redis Labs reserves the right to use submitted videos for marketing purposes
Submissions are due by end of day Friday, April 17, 2020. Winners will be notified by April 24, 2020.
Tips and inspiration
Wondering what would make a great video? Check out the winning Redis Day Bangalore Hackathon projects. Redis Geeks were challenged to demonstrate how to use Redis beyond caching in three categories: Event-Driven Architecture; Redis Modules; and Integrations, Plug-ins, Clients, and Frameworks.
Story of Being On-Call, created by Atish Andhare. The RedisTimeSeries-powered Slackbot was designed to provide quick observations of your systems on the go, such as when you’re commuting or traveling.
Even if you don’t submit a video, don’t miss RedisConf 2020 Takeaway, starting May 12. RedisConf is a learning conference for developers and cloud professionals from the makers of the world’s most-loved database. It offers the perfect opportunity to dive into the latest innovations and trends in data platforms, share your ideas, learn from valuable experiences of Redis users across the globe, and get hands-on training. Register for free here.
Moving to remote operations can be difficult in any industry, but especially in education where various stakeholders need to be prepared – from students, teachers, families, and administrators. Moving a highly structured, in-person environment to online is no simple task, so educational leaders across age and learner groups seek out best practices to support remote learning, teaching, and core operations.
In the event of environmental disruptions, K12 schools, colleges, and universities are using virtual desktops and applications as one solution to help minimize downtime and support staff and students from anywhere, anytime, on any device.
For students and faculty, this means there is no disruption in learning—the format switches to a distance learning model. For administration, this means supporting remote students and faculty and managing mission-critical work streams with a distributed workforce. For IT, this means addressing application and hardware requirements, securing policy and profile management with reliability.
Here are three examples of how the cloud can help education quickly scale to support disruptions.
Expanding access to labs through cloud-based solutions
While on-premises computer labs must close when a physical building closes, virtual labs can provide students with instant access to desktop applications streamed through an encrypted, secure browser—regardless of hardware or device. Virtual application streaming can be used for any type of curriculum, from college-level engineering to administrative apps in K12 classrooms. Learn how Cornell University reimagined course delivery with Amazon Workspaces.
Cloud-based solutions, like Amazon AppStream 2.0, run a school’s operating system and applications in the cloud in a scalable, managed environment—regardless of location. Virtual applications are built for the most security-sensitive organizations, as data is not stored on users’ computers.
Enabling administrators to access core applications via virtual desktop
Additionally, many institutions are looking for options that maintain mission-critical applications (payroll, employee records, and more) remotely and securely. Many organizations are looking to persistent desktops (like Amazon WorkSpaces) and application stream to fill this gap that traditional infrastructure has left behind.
Communicating effectively when it matters the most
Mass emails and texts have changed how schools announce major changes from snow days to burst pipes. However, some emergencies will cause schools to struggle to support inbound call volume from concerned students, parents, and others seeking information.
Virtual call center solutions can help expand your capacity while not requiring those answering phones to be physically on-campus. Amazon Connect creates flexibility that allows institutions to rapidly shift operations between buildings or even off campus in minutes. Whether it’s supporting student workers with the help desk phone while they are off campus or administrators unable to physically be at their desk, these solutions can help provide your school options without extensive setup.
Learn more about AWS’s End User Computing (EUC) partners, who support institutions that need to provision, protect, and gain intelligence from end-point devices.
A Pimoroni STS-Pi Robot Kit connected to AWS for remote control and viewing.
A telepresence robot allows you to explore remote environments from the comfort of your home through live stream video and remote control. These types of robots can improve the lives of the disabled, elderly, or those that simply cannot be with their coworkers or loved ones in person. Some are used to explore off-world terrain and others for search and rescue.
This guide walks through building a simple telepresence robot using a Pimoroni STS-PI Raspberry Pi robot kit. A Raspberry Pi is a small low-cost device that runs Linux. Add-on modules for Raspberry Pi are called “hats”. You can substitute this kit with any mobile platform that uses two motors wired to an Adafruit Motor Hat or a Pimoroni Explorer Hat.
The sample serverless application uses AWS Lambda and Amazon API Gateway to create a REST API for driving the robot. A Python application running on the robot uses AWS IoT Core to receive drive commands and authenticate with Amazon Kinesis Video Streams with WebRTC using an IoT Credentials Provider. In the next blog I walk through deploying a web frontend to both view the livestream and control the robot via the API.
Prerequisites
You need the following to complete the project:
A Pimoroni STS-Pi robot kit, Explorer Hat, Raspberry Pi, camera, and battery.
Estimated Cost: $120
There are three major parts to this project. First deploy the serverless backend using the AWS Serverless Application Repository. Then assemble the robot and run an installer on the Raspberry Pi. Finally, configure and run the Python application on the robot to confirm it can be driven through the API and is streaming video.
Deploy the serverless application
In this section, use the Serverless Application Repository to deploy the backend resources for the robot. The resources to deploy are defined using the AWS Serverless Application Model (SAM), an open-source framework for building serverless applications using AWS CloudFormation. To deeper understand how this application is built, look at the SAM template in the GitHub repository.
The Python application that runs on the robot requires permissions to connect as an IoT Thing and subscribe to messages sent to a specific topic on the AWS IoT Core message broker. The following policy is created in the SAM template:
To transmit video, the Python application runs the amazon-kinesis-video-streams-webrtc-sdk-c sample in a subprocess. Instead of using separate credentials to authenticate with Kinesis Video Streams, a Role Alias policy is created so that IoT credentials can be used.
This role grants access to connect and transmit video over WebRTC using the Kinesis Video Streams signaling channel deployed by the serverless application.
A deployed API Gateway endpoint, when called with valid JSON, invokes a Lambda function that publishes to an IoT message topic, RobotName/action. The Python application on the robot subscribes to this topic and drives the motors based on any received message that maps to a command.
On the next page, under Application Settings, fill out the parameter, RobotName.
Choose Deploy.
Once complete, choose View CloudFormation Stack.
Select the Outputs tab. Copy the ApiURL and the EndpointURL for use when configuring the robot.
Create and download the AWS IoT device certificate
The robot requires an AWS IoT root CA (fetched by the install script), certificate, and private key to authenticate with AWS IoT Core. The certificate and private key are not created by the serverless application since they can only be downloaded on creation. Create a new certificate and attach the IoT policy and Role Alias policy deployed by the serverless application.
Choose the Thing that corresponds with the name of the robot.
Under Security, choose Create certificate.
Choose Activate.
Download the Private Key and Thing Certificate. Save these securely, as this is the only time you can download this certificate.
Choose Attach Policy.
Two policies are created and must be attached. From the list, select Policy AliasPolicy-
Choose Done.
Flash an operating system to an SD card
The Raspberry Pi single-board Linux computer uses an SD card as the main file system storage. Raspbian Buster Lite is an officially supported Debian Linux operating system that must be flashed to an SD card. Balena.io has created an application called balenaEtcher for the sole purpose of accomplishing this safely.
Insert the SD card into your computer and run balenaEtcher.
Choose the Raspbian image. Choose Flash to burn the image to the SD card.
When flashing is complete, balenaEtcher dismounts the SD card.
Configure Wi-Fi and SSH headless
Typically, a keyboard and monitor are used to configure Wi-Fi or to access the command line on a Raspberry Pi. Since it is on a mobile platform, configure the Raspberry Pi to connect to a Wi-Fi network and enable remote access headless by adding configuration files to the SD card.
Re-insert the SD card to your computer so that it shows as volume boot.
Create a file in the boot volume of the SD card named wpa_supplicant.conf.
Paste in the following contents, substituting your Wi-Fi credentials.
Create an empty file without a file extension in the boot volume named ssh. At boot, the Raspbian operating system looks for this file and enables remote access if it exists. This can be done from a command line:
Since the installation may take some time, power the Raspberry Pi using a USB 5V power supply connected to a wall plug rather than a battery.
Connect remotely using SSH
Use your computer to gain remote command line access of the Raspberry Pi using SSH. Both devices must be on the same network.
Open a terminal application with SSH installed. It is already built into Linux and Mac OS, to enable SSH on Windows follow these instructions.
Enter the following to begin a secure shell session as user pi on the default local hostname raspberrypi, which resolves to the IP address of the device using MDNS:
ssh pi@raspberrypi.local
If prompted to add an SSH key to the list of known hosts, type yes.
When prompted for a password, type raspberry. This is the default password and can be changed using the raspi-config utility.
Upon successful login, you now have shell access to your Raspberry Pi device.
Enable the camera using raspi-config
A built-in utility, raspi-config, provides an easy to use interface for configuring Raspbian. You must enable the camera module, along with I2C, a serial bus used for communicating with the motor driver.
In an open SSH session, type the following to open the raspi-config utility:
sudo raspi-config
Using the arrows, choose Interfacing Options.
Choose Camera. When prompted, choose Yes to enable the camera module.
While the script installs, proceed to the next section.
Configure the code
The Python application on the robot subscribes to AWS IoT Core to receive messages. It requires the certificate and private key created for the IoT thing to authenticate. These files must be copied to the directory where the Python application is stored on the Raspberry Pi.
It also requires the IoT Credentials endpoint is added to the file config.json to assume permissions necessary to transmit video to Amazon Kinesis Video Streams.
Open an SSH session on the Raspberry Pi.
Open the certificate.pem file with the nano text editor and paste in the contents of the certificate downloaded earlier.
Provide the following information: IOT_THINGNAME: The name of your robot, as set in the serverless application. IOT_CORE_ENDPOINT: This is found under the Settings page in the AWS IoT Core console. IOT_GET_CREDENTIAL_ENDPOINT: Provided by the serverless application. ROLE_ALIAS: This is already set to match the Role Alias deployed by the serverless application. AWS_DEFAULT_REGION: Corresponds to the Region the application is deployed in.
Save the file using CTRL+X and Y.
To start the robot, run the command:
python3 main.py
To stop the script, press CTRL+C.
View the Kinesis video stream
The following steps create a WebRTC connection with the robot to view the live stream.
Choose the channel that corresponds with the name of your robot.
Open the Media Playback card.
After a moment, a WebRTC peer to peer connection is negotiated and live video is displayed.
Sending drive commands
The serverless backend includes an Amazon API Gateway REST endpoint that publishes JSON messages to the Python script on the robot.
The robot expects a message:
{ “action”: }
Where direction can be “forward”, “backwards”, “left”, or “right”.
While the Python script is running on the robot, open another terminal window.
Run this command to tell the robot to drive forward. Replace using the endpoint listed under Outputs in the CloudFormation stack for the serverless application.
curl -d '{"action":"forward"}' -H "Content-Type: application/json" -X POST https:///publish
Conclusion
In this post, I show how to build and program a telepresence robot with remote control and a live video feed in the cloud. I did this by installing a Python application on a Raspberry Pi robot and deploying a serverless application.
The Python application uses AWS IoT credentials to receive remote commands from the cloud and transmit live video using Kinesis Video Streams with WebRTC. The serverless application deploys a REST endpoint using API Gateway and a Lambda function. Any application that can connect to the endpoint can drive the robot.
In part two, I build on this project by deploying a web interface for the robot using AWS Amplify.
A preview of the web frontend built in the next blog.
A Kinesis shard provides 1 MB/s ingress and 2 MB/s egress throughput. You can increase stream throughput by adding more shards. Previously, the UpdateShardCount API could scale up to 500 shards. Today’s announcement enables you to rapidly scale your steam capacity up to 10,000 shards, a 20X increase. As an example, if you had 10 shards delivering 10 MB/s throughput, you could previously scale up to 500 shards to ingest 500 MB/s. Now, you can scale up to 10,000 shards or 10,000 MB/s in response to traffic increase with a single API call or a click in the console. You can then scale down capacity after a reduction in traffic.
Updating the shard count is an asynchronous operation. Kinesis Data Streams performs splits or merges on individual shards to update the shard count. You can continue to read and write data to your stream while the scaling operation is in progress. We recommend using KPL version 0.14.0 or higher, and upgrade KCL to version 1.9 or above for using this capability. Please refer to the API Reference documentation for more details.
The 10,000 shard increase for the UpdateShardCount API is available in all AWS regions.
Placement management software is a web application. For the placement and training department in the college. This application helps to manage the student information with regards to the placements.
The web application can be accessed throughout the organization with proper login provided. The placement management software or system helps the students, company to register and communicate all the information in the portal.
The users can easily get access to the portal and also the data can be retrieved easily within no time. In various colleges, training and placement officers have to manage the student’s profile and documents for their placements manually.
The placement officers will collect the information from various companies who want to recruit the students and updates to the students from time to time. And also arranges the profile of students according to various streams. The placement officer will clearly notify the needs and requirements of the company.
The administrator will play an important role in the placement portal because they provide approval of students’ registration and update in the portal. We can create a search engine for the administrator who can search for everything regarding students and the company.
Earlier, It was difficult to communicate the information with the “N” number of students together about the placement drives. So the web application was designed which was easy and efficient to communicate the information to the students in a manual way.
RECRUITER:
CREATE NEW DRIVE:
A New Drive contains the student’s information as well as the company information. A new drive can be created by filling out the student’s personal details and academic details, like name, program, stream, batch. The criteria of percentage and backlog will be mentioned by the organization in default. Students who are eligible according to the organization criteria can only attend the drive.
If there is any backlog of the students that also should be mentioned along with their stream name. The Date of drive and the last date of application will be filled by default and it cannot be changed. The job designation can be selected and the description will also be mentioned by the organization.
Once all the details are entered we should verify that all the above-mentioned details are true and accurate. The process would help the company to get all the information about the students.
EDUCATION DETAILS:
The students who are participating in the drive should fill their information, like the name of the college, batch, And the general criteria of percentage, backlog, previous year backlog (which will be mentioned by default, to all the students if you need any changes you can change by clicking on Add custom criteria) Should be mentioned.
Programs and stream names contain details like 10th and 12th CGPA and percentages should be specified, the package will be specified by default according to the organization’s ability.
Next, if the student has done any specializations (masters) the name of that should be mentioned. The number of years of bond or agreement the company would like to have with you will be mentioned. The date of the drive, the last date of submission of application and designation, Job descriptions, information about the company will be mentioned by default for all the students.
DRIVE DETAILS :
Drive details contain the information of the date of drive and institution name and it also contains details of the contact person (placement officer details), contact details (phone number and email address), college address.
The page shows the details of the list of students who have applied for the drive along with the date when they have applied. The applied students can attend the interview on the mentioned date. If students clear any of the rounds in the interview and further get shortlisted for the next round their names will be mentioned in the list.
Once if the students clear their final round the final result will be declared on the same page. the students who have cracked the interview will be mentioned. Likewise, the students can keep track of all these details.
FINAL RESULTS STEPS:
STEP 1
Once you clear your final round offer is given in 3 ways :
By group
By individual
By same to all
The groups are made by the recruiter according to their streams and the package will be fixed for each group. The same package will be fixed for all the students in one group with the same stream. To know individual packages proceed to step 2.
STEP 2
Only after completion of first step students are eligible for further steps. Click on the button to send the offer letter according to groups created by their streams and the individual package will be mentioned here.
STEP 3
Only after completion of the first two-step students can further move on to the third step and can add messages. Finally, the offer will be sent to students.
The next page that appears is that review of all the details that you have filled in the above 3steps. This step is like a confirmation of your package details and stream name. Before you click CREATE please check that all your details are accurate and then proceed because once you click create it will be finalized you cannot make any changes further.
STUDENTS (SEARCH & VIEW):
The student’s search & view is the details of all students who have applied for the drive. In the search box if we fill the details and click on search the student details will appear.
The list of students who have applied for the drive will be shown with the following details like:
Students Roll No,
Phone number,
Batch year.
It is a reference which will be helpful to the recruiter.
CALENDAR OF EVENT :
The calendar is used to make a record of the events that are happening in the future. If we add any events on a particular date along with time in the calendar it will give an alert or a notification on the date of the event within the exact time.
FEEDBACK:
Feedback is an important thing to any of them in order to improve or change themselves. Feedback helps to bring out a better version of them.
In the portal, there will be questions and students have to answer the questions and give feedback. The students can also add their comments in the feedback section and submit.
PROFILE:
The professional profile is important to everyone in order to convey all the detailed information. It is a snapshot of skills, accomplishment, knowledge, etc.
Employee profile will be created with
company logo,
name of the employee,
Designation (job description),
Sex,
personal Email of the employee,
Phone number, Company name,
Company address.
All these details will be specified in their profile.
HISTORY:
Recruiter history contains the following information like Company name, contact person, designation, phone number, email. All this information will be recorded in the recruiter’s history which will help to maintain records and also to get information easily.
STUDENT:
Personal Details:
A profile is important to everyone in order to convey all the detailed information. It is a snapshot of skills, accomplishment, knowledge, etc.
The personal details of the student like:
College name,
Roll no of the student,
Program, Stream,
Enrollment year,
Current year percentage,
Personal Email address,
Phone number,
Date of birth,
Nationality,
Current address,
Permanent address.
All these personal details will help the company to know about the student.
Educational Details:
Education details contain all the detailed information of the student’s education like 10th, 12th, degree, masters, specializations, institutions names, pass out years, board or university, percentage, backlogs should be mentioned. Which would be useful to the company for their confirmation.
RESUME:
A resume is a formal document that furnishes a student’s educational details, career details, background, unique skills, etc. which helps to attract the interviewer by adding your extra skills known to them. In the portal resume can be in PDF form or DOCUMENT form.
NEW DRIVES:
Pending drives will help us to know about the drive or the company which we have not applied yet. It gives the notification that the students can view the company details and apply for the job.
Applied drive will help us to know about the drive which we have already applied and further details and information will be shared by the company.
NEW DRIVE DETAILS:
If you apply for the drive it gives complete information of job description like:
Job designation,
Location of the job,
Drive date,
Last date for applying,
For which stream students are recruiting,
Criteria of percentage and backlog,
Job description in details.
Which will help students to know more about the company and can apply for the job.
RESULTS:
Once students successfully apply for the interview they will further call for the online interview. Once the students are finished with their 1st round interview if they are qualified for 2nd round the company will be intimidating for the further round. If they are rejected that also will be intimated.
EXPERIENCE:
Students who get selected can share their personal experience in the portal which would help other students who are interested in that job can have better clarity of the company and would help them to make further decisions.
LANDING PAGE:
LOGIN:
The login is a set of credentials used to verify a user. In the login page, we have the user’s name and password where the user can log in to the system for further utilization.
Post drive:
Post drive is a drive which is on campus currently but most of them prefer not to attend the interview for such companies because of some specific reason. This drive contains information like Name, Company, Contact number, Company name, Batch, Program, Stream, job location, no of years of bond. All this information should be filled.
Login can be done by 4 people they are:
Recruiter
Placement director
Institution admin
Students
Recruiter is the person who takes the responsibility of recruiting the students by conducting an interview according to the requirements of their organization.
The placement director is the person who builds a connection between the students and the company authority people. And collects all the information regarding students who are interested in drive and makes sure that the drive is conducted smoothly.
The administrator will play an important role in the placement portal because they provide approval of students’ registration and update in the portal. We create a search engine for the administrator who can search for everything regarding students and the company.
Students are the main part of the driver who gets to know about the drive and participate in the interview in order to get placed in a better company for a bright career.
NOTICE FOR TPO:
A notice can be created by TRAINING AND PLACEMENT DEPARTMENT (TPO) who helps the students to get a notification of placements happening and also companies that are upcoming to conduct a drive. A notice can be kept private for particular persons like TPO only, students only, both TPO and students.
ADD TPO ROLES
ADD COLLEGE
ADD PROGRAM AND STREAMS
TPO ROLES & ADD COLLEGE:
TPO is created to build a training and placement department in the college which would benefit students for their career as well as a college for their reputation. TPO contains colleges, TPO names, phone number, email id. By filling all the details the college name will be added in the portal.
ADD PROGRAM AND STREAM:
Add program name and stream name and click on save so that the list of programs and streams name that you typed will be saved and appears down. You can add only one program name with multiple streams names at a time.
PLACEMENTS ANALYTICS:
The placement analytics helps the placement officer to analyze the placements conducted in the college and analyze the achievements and the drawback.
The Institute of Business Forecasting has produced an 80-minute virtual town hall on “Forecasting & Planning During the Chaos of a Global Pandemic.” The on-demand video recording is available now and well worth a look. There is much solid practical guidance from an experienced panel:
Eric Wilson, IBF DIrector of Thought Leadership (moderator)
Dustin Deal, Director of North American Business Operations, Lenovo
Patrick Bower, Sr. Director of Global Supply Chain Planning & Customer Service, Combe Inc.
Andrew Schneider, Global Demand Manager for Supply Chain, Medtronic
John Hellriegel, Sr. Advisor and Facilitator, IBF
Following are my key takeaways from each panelist:
John Hellriegel:
Macro forecasting is hard enough right now, and micro forecasting (down to product level) even more so.
There are many interventions going on beyond what is normal, such as government stimulus, fallen oil prices, etc. all adding to the uncertainty and complexity.
High forecast accuracy in unlikely, so a demand planner should focus on helping the business understand the uncertainties and make appropriate decisions.
Simple models with clear assumptions may be most helpful (e.g., “take a 25% reduction across a category for 3 months,” rather than spend a lot of effort trying to adjust each item).
Dustin Deal:
Production is ramping back up in China, yet there are still logistical delays.
Be collecting data at macro and micro level, including channel inventory and sell through.
Know where channel inventory is low, and where you’ll need immediate replenishment.
Execute planning (e.g. S&OP) more frequently.
Andrew Schneider:
Not doing typical demand planning right now, instead focus on “demand sanitary services” (cleaning data, figuring how much is pre-buy to stock vs. actual consumption, etc.).
Focus on demand control and shaping until things get back to normal.
Utilize the coefficient of variation to identify which products are most impacted by Covid-19 (big increases in CoV). Segment the products according the impact. Consider a risk-based ABC analysis.
Distinguish the “passive” work of observing and collecting data, versus the “proactive” work of driving demand from shortages to substitute available products.
Assess the quality of the demand signal. POS is great if you have it, but if not, try to figure out what the customer really wanted (compared to what orders, cuts, backorders, etc. look like).
Try to use additional streams of external data — what does it say? — not just your own internal data.
Consider the probability distributions of demand, but don’t commit yourself to the extreme and get your organization in trouble later when things normalize.
Don’t worry so much about accuracy right now. Instead consider the FVA of different approaches
Be the calm voice — be realistic and understand the data before overreacting.
Patrick Bower:
There is lots of complexity and ambiguity right now, not much demand planning can happen.
We’ll know more when March closes, and for now can just watch week-to-week but can’t fully replan until we know more.
No brand loyalty right now for scarce product, so brands will have a hard time knowing what is true demand.
Document everything and reconsider the assumptions of existing plans. Get close to the data (especially cuts, backorders, and future returns) and know your customers. Be wary of future cancels.
Suppliers may be unable to deliver in short term, so be aware of lead times.
A big challenge is how to interpret data, handle outliers, adjust history, and adjust models. Better to do these in a couple of months when we have a better understanding. Collect the data but don’t rush to make changes.
Use S&OP to plan for the new normal (which will likely include recession).
Take a breath and don’t overreact. Reach out to experienced colleagues who have been through situations like this before.
Use downstream data that reflects true consumer demand.
Adopt and implement advanced analytics and machine learning algorithms in your demand forecasting and planning.
Implement a short-term demand forecasting and planning process.
Incorporate social media information.
Focus on the granular view and regional geographic areas.
How do you handle the abnormal historical data after everything goes back to normal?
SAS Coronavirus Dashboard
For more insight that may be helpful in your forecasting and planning efforts, some of my colleagues at SAS have created a Novel Coronavirus Report using SAS Visual Analytics that depicts the status, locations, spread and trend analysis of the coronavirus.
Data is updated nightly. The ability to visualize the COVID-19 outbreak can help raise awareness, understand its impact and can ultimately assist in prevention efforts. View the SAS Coronavirus dashboard to see maps based in ESRI, coronavirus statistics, and an animated timeline of worldwide spread.
Many businesses operate call centers that record conversations with customers for training or regulatory purposes. These vast collections of audio offer unique opportunities for improving customer service. However, since audio data is mostly unsearchable, it’s usually archived in these systems and never analyzed for insights.
Developing machine learning models for accurately understanding and transcribing speech is also a major challenge. These models require large datasets for optimal performance, along with teams of experts to build and maintain the software. This puts it out of reach for the majority of businesses and organizations. Fortunately, you can use AWS services to handle this difficult problem.
In this blog post, I show how you can use a serverless approach to analyze audio data from your call center. You can clone this application from the GitHub repo and modify to meet your needs. The solution uses Amazon ML services, together with scalable storage, and serverless compute. The example application has the following architecture:
For call center analysis, this application is useful to determine the types of general topics that customers are calling about. It can also detect the sentiment of the conversation, so if the call is a compliment or a complaint, you could take additional action. When combined with other metadata such as caller location or time of day, this can yield important insights to help you improve customer experience. For example, you might discover there are common service issues in a geography at a certain time of day.
To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.
How the application works
A key part of the serverless solution is Amazon S3, an object store that scales to meet your storage needs. When new objects are stored, this triggers AWS Lambda functions, which scale to keep pace with S3 usage. The application coordinates activities between the S3 bucket and two managed Machine Learning (ML) services, storing the results in an Amazon DynamoDB table.
The ML services used are:
Amazon Transcribe, which transcribes audio data into JSON output, using a process called automatic speech recognition. This can understand 31 languages and dialects, and identify different speakers in a customer support call.
Amazon Comprehend, which offers sentiment analysis as one of its core features. This service returns an array of scores to estimate the probability that the input text is positive, negative, neutral, or mixed.
A downstream process, such as a call recording system, stores audio data in the application’s S3 bucket.
When the MP3 objects are stored, this triggers the Transcribe function. The function creates a new job in the Amazon Transcribe service.
When the transcription process finishes, Transcribe stores the JSON result in the same S3 bucket.
This JSON object triggers the Sentiment function. The Sentiment function requests a sentiment analysis from the Comprehend service.
After receiving the sentiment scores, this function stores the results in a DynamoDB table.
There is only one bucket used in the application. The two Lambda functions are triggered by the same bucket, using different object suffixes. This is configured in the SAM template, shown here:
To test the application, you need an MP3 audio file containing spoken text. For example, in my testing, I use audio files of a person reading business reviews representing positive, neutral, and negative experiences.
After cloning the GitHub repo, follow the instructions in the README.md file to deploy the application. Note the name of the S3 bucket output in the deployment.
Upload your test MP3 files using this command in a terminal, replacing your-bucket-name with the deployed bucket name:aws s3 cp . s3://your-bucket-name --recursiveOnce executed, your terminal shows the uploaded media files:
Navigate to the Amazon Transcribe console, and choose Transcription jobs in the left-side menu. The MP3 files you uploaded appear here as separate jobs:
Once the Status column shows all pending job as Complete, navigate to the DynamoDB console.
Choose Tables from the left-side menu and select the table created by the deployment. Choose the Items tab: Each MP3 file appears as a separate item with a sentiment rating and a probability for each sentiment category. It also includes the transcript of the audio.
Handling multiple languages
One of the most useful aspects of serverless architecture is the ability to add functionality easily. For call centers handling multiple languages, ideally you should translate to a common language for sentiment scoring. With this application, it’s easy to add an extra step to the process to translate the transcription language to a base language:
A new Translate Lambda function is invoked by the S3 JSON suffix filter and creates text output in a common base language. The sentiment scoring function is triggered by new objects with the suffix TXT.
In this modified case, when the MP3 audio file is uploaded to S3, you can append the language identifier as metadata to the object. For example, to upload an MP3 with a French language identifier using the AWS CLI:
The first Lambda function passes the language identifier to the Transcribe service. In the Transcribe console, the language appears in the new job:
After the job finishes, the JSON output is stored in the same S3 bucket. It shows the transcription from the French language audio:
The new Translate Lambda function passes the transcript value into the Amazon Translate service. This converts the French to English and saves the translation as a text file. The sentiment Lambda function now uses the contents of this text file to generate the sentiment scores.
This approach allows you to accept audio in a wide range of spoken languages but standardize your analytics in one base language.
Developing for extensibility
You might want to take action on phone calls that have a negative sentiment score, or publish scores to other applications in your organization. This architecture makes it simple to extend functionality once DynamoDB saves the sentiment scores. By using DynamoDB Streams, you can invoke a Lambda function each time a record is created or updated in the underlying DynamoDB table:
In this case, the routing function could trigger an email via Amazon SES where the sentiment score is negative. For example, this could email a manager to follow up with the customer. Alternatively, you may choose to publish all scores and results to any downstream application with Amazon EventBridge. By publishing events to the default event bus, you can allow consuming applications to build new functionality without needing any direct integration.
Deferred execution in Amazon Transcribe
The services used in the example application are all highly scalable and highly available, and can handle significant amounts of data. Amazon Transcribe allows up to 100 concurrent transcription jobs – see the service limits and quotas for more information.
The service also provides a mechanism for deferred execution, which allows you to hold jobs in a queue. When the numbering of executing jobs falls below the concurrent execution limit, the service takes the next job from this queue. This effectively means you can submit any number of jobs to the Transcribe service, and it manages the queue and processing automatically.
To use this feature, there are two additional attributes used in the startTranscriptionJob method of the AWS.TranscribeService object. When added to the Lambda handler in the Transcribe function, the code looks like this:
After setting AllowDeferredExecution to true, you must also provide an IAM role ARN in the DataAccessRoleArn attribute. For more information on how to use this feature, see the Transcribe documentation for job execution settings.
Conclusion
In this blog post, I show how to transcribe the content of audio files and calculate a sentiment score. This can be useful for organizations wanting to analyze saved audio for customer calls, webinars, or team meetings.
This solution uses Amazon ML services to handle the audio and text analysis, and serverless services like S3 and Lambda to manage the storage and business logic. The serverless application here can scale to handle large amounts of production data. You can also easily extend the application to provide new functionality, built specifically for your organization’s use-case.
To learn more about building serverless applications at scale, visit the AWS Serverless website.
Organizations across the globe are using advanced analytics and data science to predict and make decisions. They are finding ways to use their vast and diverse data stores to predict the best place to put their next retail store, what products to recommend to customers, how many employees they need for peak hours of operation, and how long a piece of machinery has until it needs maintenance. Public sector organizations in government, education, nonprofit, and healthcare are looking to use data to advance their missions too.
Expanding and improving government response through analytics
Advanced and predictive analytics are being applied in a range of areas including fraud detection, security, safety, healthcare, and disaster response. In the recent article, “Anticipatory government: Preempting problems through predictive analytics,” AWS Partner Network (APN) Premier Consulting Partner Deloitte highlighted examples of analytics at work in the public sector including improving natural disaster readiness, fighting human trafficking, predicting cyberattacks, and preventing child abuse and fatalities. According to the article, 34% of the chief data officers in the US government use predictive modeling.
Cities across the globe are using predictive analytics to predict and align resources for natural disasters like flooding. Transportation authorities are using advanced analytics to improve traffic flow and detect potential flight risks. Federal tax agencies are using big data and predictive analytics to identify tax evaders and improve compliance. Other government agencies are using predictive modeling to identify children at risk of abuse and mistreatment.
Analytic deployments are ideal for the cloud because of their data intensive formations. Cloud services can support the growing pools of data more cost-effectively and offer improved business continuity to mission-critical data. Database and data warehouse solutions can be spun up in minutes and can dynamically scale to adjust to data volumes. Data lakes can combine diverse datasets for more complex business insights. And the experimentation and implementation of newer technologies such as artificial intelligence (AI), machine learning (ML), or a serverless computing foundation is easier, faster, and more cost-effective in the cloud.
Check out these examples of AWS customers who have implemented analytics solutions on AWS.
FINRA
To respond to rapidly changing market dynamics, the Financial Industry Regulatory Authority (FINRA) moved 90 percent of its data volumes to AWS, to capture, analyze, and store a daily influx of 37 billion records. FINRA uses AWS to power their big data analytics pipeline that handles 135 billion events per day to help monitor the market, prevent financial fraud, protect investors, and maintain market integrity.
Brain Power, together with AWS Professional Services, built a system to analyze body language and help analyze clinical trial videos of children with autism and/or ADHD. The system can use accessible technologies such as webcams and mobile devices to stream video directly to Amazon Kinesis Video Streams and later to Amazon Rekognition to detect body signals. Raw data is ingested into Amazon Kinesis Data Streams and consumed by AWS Lambda functions to analyze and mathematically compute attention and body motion metrics.
Figure 1: Predictive data science with Amazon SageMaker and a data lake on AWS
Learn more
Analytics workloads can be complex and challenging to deploy. The customers highlighted in this post used AWS Professional Services to help build their analytics solutions. You can request an AWS Professional Services consultation on the AWS Professional Services website.
Additionally, many customers who are looking to migrate analytic solutions to AWS have taken the AWS cloud adoption readiness assessment. This online tool measures your level of cloud readiness across six key perspectives: business, people, process, platform, operations, and security. The tool then provides you with a custom report to use in a professional services engagement, or to help kick-off your migration business plan, executive engagement, and the building of your migration roadmap.