Webtrends is a 20-year old company and we have done it all when it comes to data. I like to say Webtrends did big data before there was Big Data.
With the vast amount of sources from which to ingest data, Webtrends is in a unique position as we deal with all different shapes and sizes of datasets collected on behalf of our clients. We are stewards of data for more than 2,000 clients (generating more than 13 billion transactions per day) and make it available to those clients in many different forms. That’s a lot of data to deal with – and, we have 20-years’ worth of it! So this is where we begin our journey to the center of the data in a Big Data world.
About five years ago, Webtrends decided to build a Big Data solution. We realized due to the growing demand for ad hoc reporting and other analytical solutions, our current proprietary data system could not support it all without a lot of work. At that time, we were operating a batch processing system like many of the analytic vendors today. We collected data in our collection facilities all around the world and transferred that data to our processing facilities where we added value to the data such as geo augmentation and ID translation. Our systems then reported results to our clients using our APIs and applications.
With our ever-connected world, this process was becoming antiquated. The engineering leadership and myself sat down and talked about what it would take to make data more prevalent in our products and at Webtrends. With the advent of Big Data, we realized there were only a few options that could be used, such as using Hadoop for storage of our clients data and to leverage YARN to develop analytics and data processing engines. One new idea was that we would consolidate all of our data storage systems, but then we realized data would need to live together both the event level detail and visitor data. Historically, Webtrends leveraged large, expensive relational databases for the detailed visitor level data while the event level data is stored in large data stores such as NASs or attached storage devices, which equated to millions of files.
“It’s a data operating system and a fundamental data platform that in the next couple of years 100 percent of large companies will adopt.” – Mike Gualtieri, principal analyst at Forrester Research
My point is that we knew how to handle data – we have been doing it for a long time and at scale. But what we needed to do was get data to our processing center quicker and then work on the storage of the data for analysis. With the boom of smart phones, people are consuming data at higher rates and are always connected and publishing data to social platforms such as Twitter and Facebook. The ingestion of streaming data made us realize the only way to compete in the digital world was to provide a streaming solution for processing and reporting on data collected for our customers. This idea created the foundation of our product Webtrends Streams and the foundation of our new data platform.
Living in the Fast Lane
Streams, our aptly named streaming platform, consists of collection, augmentation and data processing. The idea of data pipeline is that we would receive an event, collect it, augment it and process it all within a sub second. We needed a way to consume and ingest data without losing it and at scale. With thousands of clients relying on us to power their business systems, the data is gold. Coming from a batch process, we realized we needed redundancy and reliability given that the Internet is not stable. By leveraging technologies like Apache Kafka and Storm, we were able to build a data pipeline that had resiliency and provided high throughput to our processing centers. On average, Webtrends collects and processes events within 40 milliseconds. Getting the data to the processing center was the first step on our journey to the center of data.
On top of the Streams platform and data pipeline, Webtrends built a series of visualizations and APIs that allow clients to consume this data in near real-time. This enabled action systems, Webtrends Action Center with email remarketing and optimization services such as Webtrends Optimize, to make in-the-moment contextual decisions about the visitor. This leads to higher retention rates and better conversion rates for our clients. Since the Webtrends collection services were separate from processing and applications, we were able to upgrade the collection facilities with the streaming platform while still feeding our legacy batch systems.
This brings us to the next phase of our story. One of the main issues with our legacy systems was our limited ability to do ad hoc analysis. Webtrends needed a way to leverage the recently built streaming system and feed a new data store that enables complete exploration of all event, session and visitor level data. That solution became Webtrends Explore.
Any Way You Want It, That’s the Way You Need It
To provide a true ad hoc experience, Webtrends needed a processing engine that allowed the full insight of all data collected in near real-time. This was not an easy task, nor was it one that we were afraid to tackle. We looked at several technologies, but nothing stood out as a way to accomplish this task. Innovation is one of our strengths, so we were off to slay our data dragon. We needed to put data in a format that would allow us to slice and pivot on it, sum, count and preform distinct counts in a very responsive way. We did not want to predefine any data because we do not necessarily know how our clients wanted to use it.
One thing we did realize was that there would be some limit around the amount of data you can explore at any given time. So we went to our clients and decided that 13 months was the data set Explore would provide. This allowed our clients do year-over-year analysis and look at the data set that was most active and relevant to their business. Now that we understood the amount of data that we were going to be dealing with, we needed to find a storage and a processing solution that would make this data available instantly. After much research, we determined that a bitmap index-based solution would be required to allow us to get the speed and performance that we needed. The generation of indexes would require us to invest in a big data solution. It means the data that was coming in via the streaming environment would be persisted and then processed into an index. Next, we determined the storage format and how to make it flexible enough to generate indexes from the data.
Hadoop was an obvious choice for Webtrends. We had a little bit of experience using Hadoop with our Heat Map offering and it would allow the storage of event, session and visitor data all in one place, which was a key goal we set out to accomplish now three years ago. Leveraging Apache Kafka in the data pipeline we put in place with Streams, we could consume and ingest the data every hour. A daily indexing process would process the data and make the generated index available to Webtrends Explore. Still feeding the legacy system, we added another product to our portfolio that enabled our clients to receive greater insight into their data any way they wanted it, when they wanted it, and how they wanted it.
Indexes when generated would be distributed across a large Apache Spark cluster; this cluster would aggregate index results and present them back to Webtrends Explore to give clients line of sight into the data. Clients have access to include custom parameter they are interested in seeing in Webtrends Explore. This is the first step of a much longer phase of unlocking the power of Hadoop and Spark.
A Sea of Data
“The sea is everything. It covers seven tenths of the terrestrial globe. Its breath is pure and healthy. It is an immense desert, where man is never lonely, for he feels life stirring on all sides. The sea is only the embodiment of a supernatural and wonderful existence. It is nothing but love and emotion; it is the Living Infinite.” – Jules Verne, Twenty Thousand Leagues Under the Sea
At Webtrends, we believe data is everything. We need it to survive in our business; our clients need it to survive in making decisions. Webtrends provides this data as a service and we have learned how to manage it, scale it and make it useful for clients. Data is sometimes associated with large bodies of water – people refer to it as data lakes, data ponds or even marshes, but at Webtrends, we have a sea of data. The data we collect is from every type of data source, and the movement of Internet of Things has caused this data to explode. The mobile phone has changed the landscape of data drastically, where people are now connected 24×7, posting updates to Facebook or Twitter, checking email at night, or reviewing the current events on an online news journal. We all have our digital needs, we all have data and making sense of data is what Webtrends can do for our customers.
Since the sea of data can be turbulent and ever changing we needed to leverage companies like Hortonworks and their Hadoop Data Platform to power our products. With our rate storage growth and data collection processing, Hadoop is the only platform that would allow us to scale and provide the services that we do today. Along with the Apache Spark project and its innovative way of processing data from Hadoop, we can collect, process and expose data in new ways. This focus on data makes Apache Spark and Apache Hadoop an investment area for Webtrends. We employ a spark committer and many of the other developers have contributed code to Spark, Hadoop and other open source projects.
Five years ago, when we set out on this journey to find the center of the data, no one could have predicted how much Hadoop would have grown. Now with the ever-growing Internet of Things, the only storage solution in the sea of data is a Hadoop ecosystem. It will power the data today, tomorrow and beyond and Webtrends will make it available to you.