Society’s ability to generate data at scale is well ahead of its ability to interpret it at the same rate, but a new company is changing that by looking at an old tool in a new way.
Launched in early May at the Oscon conference in Austin, Texas, Pilosa is a new generation of technology that decouples the index from data storage and optimizes it for massive scale by deploying a bitmap index on high-cardinality data, CEO Higinio Maycotte said.
Society is generating new data at a rate much faster than Moore’s Law. That volume makes it harder to interpret data at scale, as data retrieval technology has fallen beside that which generates it.
“It’s going to solve a major problem for everyone who works with data sets of one terabyte or more,” Mr. Maycotte said. “Pilosa makes a terabyte of data respond to queries as if it were 10 megabytes.”
He looked for enterprise open source solutions and found nothing suitable. They were either cobbled together, expensive, slow, or fragile. So Umbel set about to build their own solution and it worked better than they could imagine.
“We realized we solved the problem for ourselves and the rest of the world,” Mr. Maycotte said.
He convinced Umbel’s board to let him spin off the solution and Pilosa was born.
To illustrate how Pilosa can create value out of incredible sums of data, they set about to find the best places in New York to get a cab. Using records of 1.3 billion records of start and finish times, the team mapped the data into the system, where the bitmap index normalized and converted the data into ones and zeros across 1.3 billion rows and 100,000 columns.
The results were impressive, they came fast, and are highly secure, Mr. Maycotte said.
“Pilosa is behind so the security in place protects that data, which is highly abstract in its nature. It is incredibly secure.”
Pilosa’s applications are limitless, Mr. Maycotte said. While many fraud models are based on small samples, Pilosa can analyze complete data sets to get clearer trends much faster than before through better interpretation and the use of historical data, which other methods struggle to make use of.
Scientific research involving proteins is a data intensive area, Mr. Maycotte said. Most existing models can only accommodate a small fraction of the actual proteins in the human body but scientists can employ Pilosa’s models and capture the entire data set.
“Genomic analyses can be completed in orders of magnitude faster,” Mr. Maycotte said.
Mr. Maycotte used the simple example of determining my favorite shirt colors. Pilosa turns that into a question by assigning a 1 or 0 to my like or dislike of every color. The binary system is highly compressible. Should someone want to know the favorite shirt colors of thousands or millions of people, this can easily be determined, along with a host of related factors.
“We want to sit on top of some of the largest data sets in the world,” Mr. Maycotte said. “Our pilot projects include moonshot initiatives like cancer research. Joining and asking questions of multiple whole genomes simultaneously is exactly the kind of work Pilosa was built to help accomplish.”