Enterprise Analytics Teams · FOUNDER TRACK RECORD

SparklineData SNAP analytics platform

Self-serve OLAP analytics on Apache Spark sustaining 10,000+ queries per second, with intelligent caching, query optimization, and a usage-based licensing model.

2016 – 2018 acquired by Oracle

3-5xQuery performance improvement

<1sCached query response time

60%Reduction in compute costs

The challenge

This is a founder track record project. Before starting Lakeshore Labs, our founder served as Principal Engineer at SparklineData from 2016 to 2018, where he helped build the SNAP analytics platform described here. The work predates Lakeshore and was delivered by the SparklineData team.

The problem SNAP attacked was a familiar one in that era of big data. Enterprises had standardized on Apache Spark and parked their data in HDFS and S3 as Parquet and ORC files, but OLAP-style queries against that data took minutes to hours each. Every dashboard refresh or ad hoc slice triggered a full table scan across the cluster. Teams compensated by building pre-aggregation pipelines: nightly jobs that rolled data up into summary tables for every question someone might ask. Those pipelines were brittle, expensive to maintain, and locked analysts into whatever cuts had been precomputed. Interactive analysis, the whole point of a BI tool, was effectively impossible on large datasets.

What we built

SNAP was an acceleration layer that sat between BI tools and SQL clients on one side and the Spark cluster on the other. Queries arrived over JDBC and Spark SQL exactly as before; nothing about the client side changed. Inside the layer, three components did the work the diagram above shows.

Cost-based query optimizer

Incoming queries passed through an optimizer built specifically for multidimensional, OLAP-shaped workloads. It used cost-based rewriting, predicate and aggregation pushdown, and semantic metadata about dimensions, metrics, and hierarchies to turn a naive full scan plan into a much narrower one before Spark ever saw it.

Pattern-aware intelligent cache

A caching tier watched query patterns over time and kept the most valuable results and data fragments hot. Repeated dashboard queries and common drill-downs were answered directly from cache in under a second, without touching the cluster at all. At peak the platform sustained more than 10,000 queries per second through this path.

Columnar OLAP store

On a cache miss, queries ran against a columnar store with OLAP indexing layered over the data lake. Instead of the full table reads of a direct Spark scan, the engine pruned down to the relevant columns and segments and handed Spark a far smaller job. End to end, query performance improved 3 to 5x, and because the cluster did so much less redundant scanning, compute costs dropped by roughly 60 percent.

The architectural bet was to accelerate data in place rather than copy it into a separate warehouse. Warehouse migrations meant duplicate storage, a second ETL surface to keep in sync, and rewriting queries for a new engine. SNAP kept the data lake as the single source of truth and made the existing Spark cluster fast enough for interactive use, so established pipelines and BI integrations kept working unchanged.

How it was delivered

SNAP began as a performance engineering effort and matured into a productized platform. The team profiled real enterprise Spark deployments to find where analytical workloads actually stalled, built the core engine around those bottlenecks, and then invested in the unglamorous work that makes infrastructure adoptable: compatibility with the major Hadoop and Spark distributions, APIs for existing data pipelines, and monitoring and management tooling for production operations.

Commercially, the platform shipped with a usage-based licensing model, which aligned cost with the query volume customers actually pushed through it and lowered the barrier to initial enterprise deployments. SNAP went into production at enterprise analytics teams running it against their existing clusters. The work culminated in SparklineData being acquired by Oracle in 2018.

What shipped

An OLAP acceleration layer for Apache Spark, serving BI tools and SQL clients over JDBC and Spark SQL
A cost-based query optimizer with semantic metadata for dimensions, metrics, and hierarchies
A pattern-aware cache delivering sub-second responses on hot queries at 10,000+ QPS sustained
A columnar OLAP store over HDFS and S3 data lakes, cutting query times 3 to 5x and compute costs by 60 percent
Enterprise packaging: distribution compatibility, pipeline APIs, production tooling, and usage-based licensing
An exit: SparklineData was acquired by Oracle

For Lakeshore clients, the relevance is the pattern: when a query layer is too slow, the answer is often not a platform migration but a well-placed optimizer, cache, and storage format in front of what you already run.

Apache SparkScalaJavaHadoopOLAPColumnar Storage

Want something like this running against your data?

Start a prototype sprint