The company collects data from across channels and platforms – from podcasts to streaming TV and social media. Nielsen uses the data it collects to produce insights for companies and advertisers to help them to cover specific audiences with advertisements to accelerate their growth.
The National Audience API service provides a series of standard reference and ratings data API methods to enable programmatic requests to National Television Ratings. As the client sells the collected data, they need a way to collect it without referencing a certain implementation method. The answer was API as a flexible data view. The client APIs offer syndicated data feeds about consumer trends, media consumption, purchase habits, and more.
Nielsen wanted to reduce the delay between data availability and reporting – the delay at the time could last up to a week.
We offered an initial discovery phase during which our solution architect would visit the customer’s premises, inspect their processes, and work with their team to identify possible solutions.
Our solution architect worked with the customer’s team to inspect their existing report generation process based on a series of ETL that precomputed aggregates and moved data storages. The solution architect proposed generating reports based on the raw data from before the ETL and pre-aggregation. They shortlisted two (Apache Spark and Apache Impala), running over the Hadoop ecosystem.
The team conducted two POC R&D projects time-boxed for two months. We created two sub-teams for these POC projects. In the end, we compared results to select the optimal engine.
Thanks to the information generated by the POCs, we have begun creating an API layer over the raw data to serve as a reporting engine and generate reports in close to real time.
We have achieved our goal of near real-time report generation. We use the native functionality of the selected framework to distribute the load over several nodes and are able to use horizontal scaling and data distribution to reach acceptable performance (~30 seconds per report).