Society’s ability to generate data at scale is well ahead of its ability to interpret it at the same rate, but a new company is changing that by looking at an old tool in a new way.
Launched in early May at the Oscon conference in Austin, Texas, Pilosa is a new generation of technology that decouples the index from data storage and optimizes it for massive scale by deploying a bitmap index on high-cardinality data, CEO Higinio Maycotte said.
Society is generating new data at a rate much faster than Moore’s Law. That volume makes it harder to interpret data at scale, as data retrieval technology has fallen beside that which generates it.
“It’s going to solve a major problem for everyone who works with data sets of one terabyte or more,” Mr. Maycotte said. “Pilosa makes a terabyte of data respond to queries as if it were 10 megabytes.”
Mr. Maycotte explained databases consist of two parts – storage and the index on which queries are run. Instead or residing within data stores, Pilosa sits on top of them. As a bitmap index, it uses less space so it can run in memory and not on disk.
Pilosa plans on using the power of the crowd to quickly offer solutions. It has introduced nine patents into the open source community where the technology develops three to five times as fast as anywhere else, Mr. Maycotte said.
“As open-source software, Pilosa is available today on GitHub. Our first version includes production-tested features, including single and multi-node index support, replication, Algorithm Plugins, Data Importer, and Basic Cluster Management. Customers can collaborate with us or pay a fee to add Pilosa to their stack and to access premium modules that we’ve built to further optimize performance.
“Our focus right now is on building a community around this software. Open-source projects live and die by the people who work on and around them.”
Mr. Maycotte describes himself as a serial entrepreneur. In 2011 he started Umbel, a company using data to enable sports and entertainment organizations to turn fans into revenue generating customers. He learned as companies grow they have to make sense of huge amounts of data. In the United States, 220 million consumers can generate 100 million unique data columns.
He looked for enterprise open source solutions and found nothing suitable. They were either cobbled together, expensive, slow, or fragile. So Umbel set about to build their own solution and it worked better than they could imagine.
“We realized we solved the problem for ourselves and the rest of the world,” Mr. Maycotte said.
He convinced Umbel’s board to let him spin off the solution and Pilosa was born.
To illustrate how Pilosa can create value out of incredible sums of data, they set about to find the best places in New York to get a cab. Using records of 1.3 billion records of start and finish times, the team mapped the data into the system, where the bitmap index normalized and converted the data into ones and zeros across 1.3 billion rows and 100,000 columns.
The results were impressive, they came fast, and are highly secure, Mr. Maycotte said.
“Pilosa is behind so the security in place protects that data, which is highly abstract in its nature. It is incredibly secure.”
Pilosa’s applications are limitless, Mr. Maycotte said. While many fraud models are based on small samples, Pilosa can analyze complete data sets to get clearer trends much faster than before through better interpretation and the use of historical data, which other methods struggle to make use of.
Scientific research involving proteins is a data intensive area, Mr. Maycotte said. Most existing models can only accommodate a small fraction of the actual proteins in the human body but scientists can employ Pilosa’s models and capture the entire data set.
“Genomic analyses can be completed in orders of magnitude faster,” Mr. Maycotte said.
Mr. Maycotte used the simple example of determining my favorite shirt colors. Pilosa turns that into a question by assigning a 1 or 0 to my like or dislike of every color. The binary system is highly compressible. Should someone want to know the favorite shirt colors of thousands or millions of people, this can easily be determined, along with a host of related factors.
“We want to sit on top of some of the largest data sets in the world,” Mr. Maycotte said. “Our pilot projects include moonshot initiatives like cancer research. Joining and asking questions of multiple whole genomes simultaneously is exactly the kind of work Pilosa was built to help accomplish.”