review : programming collective intelligence - toby segaran

Late last spring, maybe early summer, I picked up a copy of this book. I didn’t really have time to engage it until a little before the fall semester started as I included it in a class. The more I worked through the text, the more I realized that this book is a lot of fun. It’s not for the novice or those who want things more fully explained.

Still, if you want to learn a great deal about how to perform mining on data openly (mostly) accessible on the web with the understanding the technical details are often left to the reader which may mean much investigation outside the text, I highly recommend it. As I said, it’s fun.

The book uses Python, with assorted third-party libraries, and assumes the reader is either comfortable with the language or is capable of learning it they follow along. I happened to be in latter and took this as an opportunity to learn Python.

In the introduction, Segaran points the reader to an introductory Python book and then provides a brief summary of some of the “quirkier” aspects of Python syntax: lists (arrays), dictionaries (hashmaps), blocks defined by indentation and list comprehension. The first three are pretty simple to understand but list comprehensions can be a bit awkward. In general, much of the syntax goes unexplained which can be problematic for the novice reader.

As the book progresses, more sophisticated uses of Python syntax and constructs appear which probably should have been given some treatment in the introduction. Basically, if you are going to discuss lists, dictionaries and comprehensions, it’s probably worthwhile to mention, say, lambda functions.

Beyond the syntax concerns which could arguably be dismissed by the introduction’s caveat is the techniques and algorithms introduced. The descriptions of the algorithms and their purposes are generally decent in the sense that you can get a feel for how the algorithms work. Plus, the code provided by the book does help illustrate things some. The problem, though, is that the algorithms and code are rather intertwined. This is a mixed blessing: The book does a good job of explaining the explicit example but it’s not trivial to generalize.

One area in which the book could use a little expansion is how to interpret the results of a particular technique. For example, clustering is presented pretty much at face value without much expansion on how to use the end results. It would also be useful to discuss tuning of the algorithms as well as really how to balance the trade-offs between different techniques; for instance, when discussing decision trees both Gini impurity and entropy are introduced, or in the second chapter, “Making Recommendations”, both Euclidean distance and the Pearson Correlation Coefficient are used. In both cases it is mentioned that trying both approaches and examining the results will lead to which provides the better tool for the job.

I do realize that some of these issues are just the reality of having a finite space in which to deliver the content. The book would be immense if all such conversations were had. Also, experience plays a major part in deciding how to best handle a particular analysis problem and that is hard to convey in any book, especially one such as this that is targeting readers who haven’t had much experience with the techniques.

One improvement that could be made is to add references. Obviously Segaran is drawing from a lot of experience and his knowledge evolved probably by reading other spins on data mining and general statistics. Sharing sources at specific points in the text to direct the reader to more complete explanations and examples would be extremely valuable.

And I have to say that the book’s source code is horrific. Of course, in the interest of fairness, I readily admit that I am hyper-anal about the style of my source code and the elegance of my data structures. Also, I suppose it could be argued that Python is a prototyping tool so worrying about literate programming is not part of the approach. Still, if code has to be read by someone other than the author – as is the case for every reader of the book – clean code is desirable.
I imagine that having to fit source code into a standard page width and not wanting it to span numerous pages is necessary. Such constraints probably lead to writing compact code rather than worrying about readability.

Not to be all negative, there is a lot of positive things to be said about the book. First and foremost is the content. It’s a great collection of topics. Segaran pours a lot into the book, especially the real-world applications. Each chapter is generally stand-alone and includes explicit examples that access data available on the web. While I wasn’t able to access all data and APIs, for example, I wasn’t able to get an API key for Hot or Not, for the most part everything worked well.

Segaran also provides a list of suggested exercises at the end of each chapter. Some are challenging. The nice part about the exercises is it gives a nice starting point for reaching a little farther. I also found this useful in the class the book was used.

Bottom line: Though there are a few rough points, the book is an absolute blast. It’s one of the most unique texts I have encountered in a while and it really opens a world of possibilities for the inclined.

No comments:

Post a Comment