review : data quality - Carlo Batini and Monica Scannapieco

This semester I had the unpleasant experience of being told by the campus bookstore, the day classes started, that the book I intended to use for my class was no longer in print and used copies were not available. I originally geared the course towards design philosophy and incorporated the book, Software Design, by David Budgen, an excellent book and its unavailability is incredibly frustrating.

As I was unable to quickly find a suitable replacement, I decided to take the course in a slightly different direction and find ways to discuss in more depth than I had before the notion of “quality”. To this end, I found Data Quality. After working through the text, I found it to be broad enough for incorporation into the class and serve as a starting point for deeper discussions but it also was unexpectedly dense, a bit discontinuous and, probably the most important, appears to contain significant errors which is why I could only rate it 3.5.

I still recommend the book as I think it contains good material but it’s important to know what to expect.

The strength of book lies in its content and to some degree the high-level organization of the chapters. I believe it works well as an undergraduate text (senior-level course). While the students may find the details challenging, getting a basic understanding of the why data quality is important, how to establish and implement a data quality framework and how to both handle the integration of different data sources and predict the quality of the end integration from the quality of the data sources is readily accessible.

The book provides plenty of references which enables deeper discussions of the topics presented. The survey-style approach the book uses provides the opportunity to direct students to follow through by locating and reading the papers/texts in which the original model was defined. In fact, to truly understand some of the topics in detail, further reading is a must. In essence, the book provides a good point of departure.

The survey style also means that there is limited attention to examples (though they are present). I found it useful, as an assignment, for the students to develop their own examples to illustrate the concepts. While this works when integrating the text in to a course, for someone reading this independently, the lack of solid examples could prove difficult. I don’t particularly see this as a negative given the point of the book.

In terms of minor weaknesses, more consistency with the examples (and to some extent better examples) and the use of unnecessary technical details are two areas that detract from the book’s quality. In terms of illustrating the former, consider the quality composition discussion in section 4.2. Part of the difficulty in understanding the topic is that multiple models and techniques were used to illustrate compositions in different dimensions. On one hand, it might be unavoidable since not every model adequately handles all dimensions. However, it was difficult to keep track of what model was being applied in what situation. Perhaps with greater clarity when establishing the context the discussion would be easier to follow. But I think focusing attention on one model at a time would simplify the delivery and help the reader see the application.

For the other case, there are often instances where overly fine details lend themselves more to confusion than to illumination. Section 5.4.3 contains descriptions of several comparison functions. In the case of n-grams, the Jaro Algorithm and Token Frequency-Inverse Document Frequency, the inclusion of explicit formula with the description is rather confusing as they are presented with limited context. For a reader who is not well-versed in a given approach, I would argue that it detracts from the point of discussion. It would be much simpler to just describe the approach as none seem to be explicitly invoked at a later time (or at least I don’t recall their use). It would be much more reader-friendly to provide a reference. In fact, in that same section, this approach was used for the description of Smith-Waterman, albeit without a reference.

The most difficult aspect of the book are the errors. I use the term “error” in the sense that, after much consideration and further research, I was not able to arrive at the conclusions presented by the authors. I fully understand the error may be on my part at which point I would concede that the issue would reflect the authors’ lack of clarity for a given issue. Errors range from something a little more involved than a simple typo, which a reader could with a little thought reconcile, to errors that lead to fundamental confusion. An example of the former is at the top of page 82, where |r 1 | should be |ref(r 1 )| (and sim. for ” r 2“). A more fundamental problem is in the description and interpretation of the of Figure 2.5, where I believe the area Cb should be the area bounded between the line, y=c_max, and the curve, C(t).

As stated earlier, I think the strengths of the book outweigh the negatives and would hope that others would not use the weaknesses pointed out here as a reason to avoid it. In terms of a class text, I think it illustrates to the students that print can’t be taken at face value and active reading is always a must. Plus, it affords the opportunity to direct students to correct error and/or explain the confusion. (But do note that I am not advocating for this style to be adopted by future authors!)

No comments:

Post a Comment