Monday, September 23, 2013

Big Data: What Worked?

"BIG DATA" created an explosion of new technologies and hype: NoSQL, Hadoop, cloud computing, highly parallel systems and analytics.

I have worked with big data technologies for several years. It has been a steep learning curve, but it has finally paid off. This post is about the big data technologies I continue to use. I will describe where they can be used in modern web architecture.




A common question is:



WHAT BIG DATA TECHNOLOGIES SHOULD I USE FOR MY WEB STARTUP?



CLASSIC THREE TIER ARCHITECTURE



For a long time software development was dominated by the / client server architecture. It is well described and conceptually simple:



* Client

* Server for business logic

* Database



It is straightforward to figure out what computations should go where.



MODERN WEB AND ANALYTICS ARCHITECTURE



The modern web and analytics architecture is not nearly as well established. It is more like an 8 tiered architecture with the following components:



* Web client

* Caching

* Stateless web server

* Realtime services

* Database

* Hadoop for log file processing

* Text search

* Ad hoc analytics system



I was hoping that I could do something closer to the 3-tier architecture, but the components have very different features. Kicking off a Hadoop job from a web request could adversely affect your time to first byte.



A problem with the modern web architecture is that a given calculation can be done in many of those components.



PREDICTIVE ANALYTICS USE CASE



It is not at all clear what component predictive analytics should be done in.



First you need to collect user metrics. In what components can do you do this?



* Web servers, store the metric in Redis / caching

* Web servers, store the metric in the database

* Realtime services aggregates the user metric

* Hadoop run on the log files



User metric is needed by predictive analytics and machine learning. Here are some scenarios for this:



* If you are doing simple popularity based predictive analytics this can be done in the web server or a realtime service.

* If you use a Bayesian bandit algorithm you will need to use a realtime service for that.

* If you recommend based on user similarity or item similarity you will need to use Hadoop.



NOSQL AND STORAGE



There are a lot of options, with no query standard:



* Cassandra

* CouchDB

* HBase

* Memcached

* MongoDB

* Redis

* SOLR



I will describe my initial excitement about NoSQL, the comeback of SQL databases and my current view on where to use NoSQL and where to use SQL.



MONGODB



was my first NoSQL technology. I used it to store structured medical documents.



Creating a normalized SQL database that represents a structured data format is a sizable task and you easily end up with 20 tables. It is hard to insert a structured document into the database in the right sequence, so foreign key constraints are satisfied. LINQ to SQL helped with this but it was slow.



I was amazed by MongoDB's simplicity:



* It was trivial to install

* It could insert 1 million documents very fast

* I could use the same Python NLP tools for many different types of documents



I FELT THAT SQL DATABASES WERE SO 20TH CENTURY.

After some use I realized that interacting with MongoDB was not as easy from Scala. I tried different libraries: and .I also realized that it is a lot harder to query data from MongoDB than a SQL database both in syntax and expressiveness.

Recently SQL databases have added JSON as a data type, taking away some of MongoDB's advantage.



Today I use SQL databases for curated data. But MongoDB for ad hoc structured document data.



REDIS



is an advanced key value store that is mainly living in memory but with backup to disk. Redis is a good fit for caching. It has some specialized operations:



* Simple to age out data

* Simulates pub sub

* Atomic update increments

* Atomic list append

* Set operations



Redis also supports sharding well, in the driver you just give a list of Redis servers and it will send the data to the right server. Redistributing data after adding more sharded servers to Redis is cumbersome.



I first thought that Redis had an odd array of features but it fits the niche of realtime caching.



SOLR



is the most used enterprise text search technology. It is built on top of .

It has an ecosystem of plugins doing a lot of the things that you would want. It is also very useful for natural language processing. You can even use SOLR as a presentation system for your NLP algorithms.



HADOOP



is a very complex piece of software to handle very large amounts of data that cannot be handled by conventional software because it is too big to fit on one computer.



I compared different Hadoop libraries in my post: .



Most developers that have used Hadoop complain about it. I am no exception. I still have problems with Hadoop jobs failing due to errors that are hard to diagnose. Generally I have been a lot happier about Hadoop lately. I am only using Hadoop for big custom extractions or calculations from log files stored in HDFS. I do my Hadoop work in Scalding or HIVE



The Hadoop library can calculate user recommendations based on user similarity or item similarity.



Scalding



Hadoop code in looks a lot like normal Scala code. The scripts I am writing are often just 10 lines of code and look a lot like my other Scala code. The catch is that you need to be able to write idiomatic functional Scala code.



HIVE



makes it easy to extract and combine data from HDFS. You just write SQL after some setup of a directory with table structure in HDFS.



REALTIME SERVICES



Libraries like Akka, Finagle and Storm are good for having long running stateful computations.

It is hard to write correct highly parallel code that scales to multiple machines using normal multithreaded programming. For more details see my blog post: .



AKKA AND SPRAY



is a simple actor model taken from the language. In Akka you have a lot of very lightweight actors, they can share a thread pool. They do not block on shared state but communicate by sending immutable messages.



One reason that Akka is a good fit for realtime services is that you can do varying degrees of loose coupling and all services can talk with each other.



It is hard to change from traditional multithreaded programming to using the actor model. There are just a lot of new actor idioms and design patterns that you have to learn. At first the actor model seems like working with a sack of fleas. You have much less control over the flow due to the distributed computation.



makes it easy to put a web or interface to your service. This makes it easy to connect your service with the rest of the world. Spray also has the best Scala serialization system I have found.



TO CLOUD OR NOT TO CLOUD



A I thought that I would soon be doing all my work using cloud computing services like Amazon's AWS. This did not happen, but virtualization did. When I request a new server the OPS team usually spins up a virtual machine.

A problem with cloud services is that storage is expensive. Especially Hadoop sized storage.

If I were in a startup I would probably consider the cloud.



BIG AND SIMPLE



My fist rule for software engineering is: KEEP IS SIMPLE.

This is particularly important in big data since size creates inherent complexity.

I made the mistakes of being too ambitious too early and think out too many scenarios.



STARTUP QUESTION



Back to the question:



WHAT BIG DATA TECHNOLOGIES SHOULD I USE FOR MY WEB STARTUP?

A common approach is:

Ruby on Rails for your web server and Python for your analytics and hope that a lot of beefy Amazon EC2 servers will scale your application when your product takes off.

It is fast to get started and the cloud will save you. What could possibly go wrong?



The approach I am describing here is more stable and scalable, but before you learn all these big data technologies you might run out of money.



BIG DATA NOT JUST HYPE



"BIG DATA" is misused and hyped. Still there is a real problem, we are generating an astounding amount of data and sometimes you have to work with it. You need new technologies to wrangle this data.

Whenever I see a reference to Hadoop in a library I get very uneasy. These complex big data technologies are often used where much simpler technologies would have sufficed. Make sure your really need them before you start. This could be the difference between your project succeeding or failing.

It has been humbling to learn these technologies but after much despair I now enjoy working with them.
Full Post

No comments:

Post a Comment