At Pellucid Analytics we we are building a platform that
automates and simplifies the creation of data-driven chartbooks, so that it takes
minutes instead of hours to get from raw data to powerful visualizations and compelling stories.
One of industries we are focusing on is Investment Banking. We are helping IB advisory
professionals build pitch-books, and provide them with analytical and quantitative support
to sell their ideas. Comparable Companies Analysis is central to this business.
Comparable company analysis starts with establishing a peer group consisting of similar companies of similar size in the same industry and region.
The problem we are faced with is finding a scalable solution to establish a peer group for any chosen company.
Approaches That We Tried
Data vendors provide industry classification
for each company, and it helps a lot in industries like retail (Wal-Mart is good comparable to Costco),
energy (Chevron and Exxon Mobil) but it stumbles with many other companies. People tend to compare
Amazon with Google as a two major players in it business, but data vendors tend to put Amazon in retail industry with Wal-Mart/Costco as comparables.
Company Financials and Valuation Multiples
We tried cluster analysis and k-nearest neighbors to group companies based on their
financials (Sales, Revenue) and valuation multiples (EV/EBIDTA, P/E). However assumptions
that similar companies will have similar valuations multiples is wrong. People compare
Twitter with Facebook as two biggest companies in social media, but based on their financials
they don’t have too much in common. Facebook 2013 revenue is almost $8 billion and Twitter has only $600 million.
We came up with an idea that if companies are often mentioned in news articles and tweets together, it’s probably a sign that people think about them as comparable companies. In this post I’ll show how we built proof of concept for this idea with Spark, Spark Streaming and Cassandra. We use only Twitter live stream data for now, accessing high quality news data is a bit more complicated problem.
From this tweet we can derive 2 mentions for 2 companies. For Facebook it will be Twitter and vice-versa. If we collect tweets for all companies over some period of time, and take a ratio of joint appearance in same tweet as a measure of “similarity”, we can build comparable company recommendations based on this measure.
We use Cassandra to store all mentions, aggregates and final recommendations.
We use Phantom DSL for scala to define schema
and for most of Cassandra operations (spark integration is not yet supported in Phantom).
Ingest Real-Time Twitter Stream
We use Spark Streaming Twitter integration to subscribe for
real-time twitter updates, then we extract company mentions and put them to Cassandra. Unfortunately Phantom
doesn’t support Spark yet, so we used Datastax Spark Cassandra Connector
with custom type mappers to map from Phantom-record types into Cassandra tables.
Spark For Aggregation and Recommendation
To come up with comparable company recommendation we use 2-step process.
1. Count mentions for each pair of tickers
After Mentions table loaded in Spark as RDD[Mention] we extract pairs of tickers,
and it enables bunch of aggregate and reduce functions from Spark PairRDDFunctions.
With aggregateByKey and given combine functions we efficiently build counter map Map[Ticker, Long] for each
ticker distributed in cluster. From single Map[Ticker, Long] we emit multiple aggregates for each ticket pair.
2. Sort aggregates and build recommendations
After aggregates computed, we sort them globally and then group them by key (Ticker). After
all aggregates grouped we produce Recommendation in single traverse distributed for each key.
You can check comparable company recommendations build from Twitter stream using this link.
Cassandra and Spark works perfectly together and allows you to build scalable data-driven applications, that are super easy to scale out and handle gigabytes and terabytes of data. In this particular case, it’s probably an overkill. Twitter doesn’t have enough finance-related activity to produce serious load. However it’s easy to extend this application and add other streams: Bloomberg News Feed, Thompson Reuters, etc.
The code for this application app can be found on Github