by Adam Baron.
The term “Big Data” has clearly surpassed buzzword status and is now part of the required lexicon for any topic tangentially related to data. Big Data means many different things to many different people. StarMine became part of the Big Data movement in 2011 during our initial Hadoop foray. As creators of quantitative finance models, we’d like to share our journey into Big Data and discuss the technologies we use.
Quants try to create models that predict security prices or the events that influence them. Before Big Data, quants used a variety of smaller data sources: pricing, fundamentals, estimates, ownership, foreign exchange, macroeconomics, etc., usually in historical time-series form that could easily fit into traditional databases such as MySQL, Oracle, or SQL Server. Thomson Reuters sells that content, which StarMine uses to develop predictive alpha and risk models. Python has been our tool of choice for data manipulation and R has served us well for statistical analysis and regression tasks. We’ve been firm believers in open source, because it’s free and because passionate contributor communities create a lot of interesting libraries.
Our data world got bigger when we decided to predict the probability of default using news, broker research, transcripts of earnings calls, and the text sections of financial filings. Most text mining involves translating documents into numerical vectors, where each word or phrase gets mapped to a vector index. Because Thomson Reuters has hundreds of millions of such documents, this transformation was more than Python running on single Linux server with 32GB RAM and 8 cores could handle. In late 2011, we launched an internal multi-tenant Hadoop cluster for use by research teams across the company. We decided to use Hive as our primary data store, because HiveQL was similar to the traditional SQL with which our quant researchers were familiar. HDFS (the Hadoop Distributed File System) is quirky when it comes to handling newlines; it often interprets a new paragraph as an entirely new record, so our formats of XML and Raw Text didn’t mesh well. We ultimately decided that JSON was a workable format, since it escaped newlines and was in a mostly flat key/value format conducive to Hive. So we wrote some quick Python to translate our documents into JSON and, with the help of an open source JSON SerDe (Serialized Deserializer) for Hive, we were in business.
Transforming a text document into a numerical vector requires performing a fair degree of text parsing. With the help of Hive User Defined Functions, akin to stored procedures in the old database world, we were able to write Java functions that took in text and outputted n-grams (collections of words) and their counts. Within this HiveUDTF, we integrated other open source Java libraries for stemming (removing suffixes like –ed an –ing) and removed common stop words (e.g. the, and, or). After cutting n-gram counts by half, we wrote some fairly straightforward HiveQL to calculate TF-IDF vectors per document. From another Thomson Reuters content set, we had a list of default events. We created training labels for documents preceding them by some window of time.
With a numerical vector representing our documents and training labels for those vectors, we were ready to develop a machine-learning model. This supervised learning situation called for classification, meaning we had historical data for which we knew the categories (i.e. predictive of a default vs. not predictive) and wanted to train a model to determine into which category a new document should be classified. Hundreds of millions of text documents in numerical vectors put us beyond the limits of what R could hold in memory on a 32GB RAM server. At the time, Mahout was the only distributed machine-learning package available for Hadoop. It had two distributed classification algorithms: Naïve Bayes and Random Forest. Our first round of evaluating model performance for each algorithm involved looking at the accuracy, precision, and recall. The default events were much rarer than the non-default events, so the latter two were important. Ultimately, we found Naïve Bayes gave yield-performance numbers and could scale much better on Mahout.
In late 2014, our technology team’s multi-tenant cluster received Spark, so we dove into Spark MLlib—one of its libraries—to explore the machine learning algorithms. Mahout’s popularity among open-source contributors had declined, so it wasn’t receiving many new algorithms, while Spark MLlib was introducing a host of exciting algorithms with each new release. By this time, the Text Mining Credit Risk (TMCR) model had already been developed and released for purchase, so we started exploring the algorithms in Spark MLlib to see if they could provide their own edge for our next text-mining projects. With SparkSQL we were able to pull from our existing TF-IDF tables in Hive and with a little Scala we easily created an input format conducive to Spark MLlib. Naïve Bayes worked just great, just as it did in Mahout, and we are currently in the midst of exploring the algorithms.
Hopefully that gives a sense of how StarMine is using Big Data, beyond the obligatory “We’re using Big Data” proclamation.