This is a guest post by Professor Matthew Jones, from Columbia’s History department, who has been attending the course. I invited him to give his perspective on the course thus far.
Few things lurk as much a challenge and instigation in data mining (or machine learning or the data sciences) as the “curse of dimensionality.” In his textbook, Christopher Bishop cautions, “the reader should be warned, however, that not all intuitions developed in spaces of low dimensionality will generalize to spaces of many dimensions.” (Bishop, 36)–a major challenge, for example, for using k nearest neighbors using Euclidean distances, as Jake Hofman mentioned on Wednesday. Curses have long invited creative counter-measures, and struggling to overcome the curse has clearly been enormously generative of algorithms and practical data processes. The data mining and more applied machine learning community, indeed, often gives a technologically determinist account that their disciplines arise of necessity from the torrent of high-dimensional data.
Another curse of dimensionality confronts anyone pursuing the data sciences. On the first day, Rachel asked us to score ourselves in a histogram of some six or seven dimensions–a piece of honest appraisal and ideological self-critique along axes we hadn’t necessarily recognized as essential. As a historian, mine was high on my domain knowledge (history of science) and
pontification communication, medium in stats, and just plain sad in R-programming and visualization techniques. I want to understand something about the data sciences by undertaking some of the training necessary to become a practitioner, and not as an addled poser or lurking fly on the wall. So I’ll be trying all term to fit a few qualitative models from the history and sociology of science with our class and its historical antecedents as data; and as with all good modeling, I’m intent on revising these tools, including the devising, I hope, of some quantitative metrics.
Intuitions developed in certain sets of dimensions, say in machine learning or in coding or in our domain specific knowledge, might well not generalize when working simultaneously in the numerous dimensions of competency Rachel sets forth as comprising the data scientist. As in machine learning, it seems to me that the curse of dimensionality remains an unsolved, but generative, question for big data pedagogy. Introductory probability has canonical subject matter and numerous well-trodden paths; not so the data sciences. The lack of a single pedagogical solution seems to have encouraged a ramification of new approaches to teaching, most stemming from a contributing discipline, such as Andrew Ng’s well-known Coursera series on machine learning, or Andrew Gelman’s insistence on exploratory data analysis.
Values in tension
A simple on-line query quickly reveals disquiet and the occasional denunciation of data science, or of statisticians, or of machine learning types–and it doesn’t appear merely the low rumble of turf wars. However much they share algorithms and mathematical formalisms, the cultures around big data appear to have differing epistemological values–and not just names for things. Last week in lecture, a student asked about levels of confidence. In machine learning, we were told, levels of confidence are not a crucial part of the culture, as the discipline is focused more on prediction. Purpose and values go together. An artifact from the Mesozoic era of the data sciences, a canonical 1991 manifesto for “knowledge discovery in databases” (KDD aka data mining) sets out the contrast of values clearly:
The text explains, “Knowledge discovery in databases, however, raises additional concerns that extend beyond those typically encountered in machine learning. In the real world, databases are often dynamic, incomplete, noisy, and much larger than typical machine learning data sets . . . . These factors render most learning algorithms ineffective in the general case. Not surprisingly, much of the work on discovery in databases focuses on overcoming these complications.” However overdrawn, this bifurcation well illustrates the point–and suggests that different pedagogy and practices will be necessary given differing epistemic values. Some will be in tension with the wonted rigor of a stats course. Which aspects of that rigor to loosen seems an unresolved and ferociously debated problem. Not all aspects, to be sure: a prominent blog quickly picked up an xkcd comic as imparting a classical statistical point essential to data mining 101. Isaiah Berlin famously proclaimed the “conflict of values may be an intrinsic, irremovable element in human life.” So too in the data sciences and nearest neighbors? Which values to focus upon, to inculcate, to teach? And with what sort of homework?
rollman<-rollman[!grep(“COMMERICAL”, rollman$BUILDING.CLASS.CATEGORY),]: or, a liberal art, not a mechanism
Our first problem set initially raised some eyebrows. It insisted upon using poorly organized, incomplete data. A pain long before applying any fancy or even naive algorithm. Open-ended answers! All a pedagogical device at variance with the cancerous regurgitative model that passes for education here and abroad. More typical data mining tutorials would just have us simply load the “iris” data; and voila nice clean data, preinstalled in R, to run knn or the like on. Instead we, or at least I, muddle through with pathetic code like that above–and that muddling or wrangling is valorized. Problem sets in any course instantiate beliefs about epistemic values and the practices necessary to gain virtues instantiating them.
Nasty data isn’t the only oddity about the philosophy of the problem set, with its insistence upon matters of careful presentation to laypeople. Classical rhetoric came immediately to mind: the ancient art, once the lifeblood of undergraduate liberal arts education, now mostly forgotten, of persuading in a way appropriate to a given audience and to a given subject matter–speaking decorously, a skilled celebrated in a perennial least favorite book of many a CC student, Cicero’s On Duties. At worst, Cicero’s advocating spin. At best, he was advocating precisely the competencies required in transforming beliefs within any technical domain–the law, philosophy, or the data sciences–into a persuasive form appropriate to an audience. The Romans needed it, so too the data scientist–a productive curse of high dimension to be sure.
Christopher Bishop tempers his warning about the curse of dimensionality with a note of hope: “Although the curse of dimensionality certainly raises important issues for pattern recognition applications,” he explains, “it does not prevent us from finding effective techniques applicable to high-dimensional spaces” (37). Now on to find effective local techniques for our individual curses of dimensionality.
(c) 2012 Matthew L. Jones