Academic life, Information technologies, Open-access, Publishing:

What shall fill the void of the author?

posted by Peter Brantley

This last week I attended a talk by Professor Paul Duguid of the UC Berkeley I-School. Prof. Duguid teaches on the topic of information quality, and recently has begun research on the history and development of trademarks and branding. As with his previous work, his talk raised issues that question the embrace of the current popular culture of open web based systems, and his commentaries are well worth sharing.

Duguid focused on a few of the more negative impacts on authoritativeness for both scholars and the public that user-writable web-based systems embody, often without conscious design. Further, many prominent public content systems oft used in scholarly communication and practice, such as online book search repositories, introduce error and bias through an incorporation of material from a supply undifferentiated in the fundamental characteristics of fidelity, relevance, and scholarly value.

Prof. Duguid believes that a belief in the virtues of transparency for data integrity is, in its expression, nearly a form of “whiggishness.” An untempered belief in Linus’ Law — “With enough eyeballs, all bugs are shallow” — is an overly naïve assessment of the impact of openness on fidelity. In short, it is vital to balance the “wisdom of crowds” with remembrance of the economist’s Gresham’s law, usually stated as “Bad money drives out the good.”

Paul described the core issue using a software-based allusion: “open information systems suffer from a missing compiler problem.” In other words, there is no external validating check on the veracity or utility of information generated by many current web based systems, as a compiler performs for certain types of program code. Although failings encompass a diversity of causes, most such web-systems embody inherent structural faults through which errors fundamental to the expected system’s utility can surface.

Paul generated examples using three widely utilized systems.

  • Gracenote
  • Project Gutenberg
  • Wikipedia

1. Gracenote is a user-generated database of information on music albums. However, there is very little control of the metadata structure submitted for works. In mild cases, this can be annoying. For example, examining the two albums produced by the contemporary flamenco group “Son de la Frontera” one notes that different descriptions are presented in the “artist” field, and the music genre (”Latin”) is questionable.

At its worst, however, metadata inconsistency generates unusable results when retrieved as control input into common playback systems such as Apple’s iTunes. For example, classical operas often have metadata that jumbles fields across different productions such as name, artist, title, author, and order of play. While it may be annoying to find songs out of sequence in a popular work, an inconsistent or faulty characterization of a highly structured, more strictly linear artistic production such as opera can wreck the enjoyment and even the utility of the work.

2. Project Gutenberg (PG) is a collection of digitized books in the public domain. While it is a tremendously useful resource, it incorporates significant faults. Notably different forms of error surface, more structural in nature than the more pseudo-random user-contributed errors found in Gracenote. Indeed, the problems that Prof. Duguid finds in PG are common to varying degree across all high-volume, heterogeneous collections of digitized books, including Google Book Search, the Open Content Alliance, and Microsoft Live Books. Mass digitization is cost-effective as a product offering only to the extent that it proceeds apace with maximal speed across an undifferentiated body of subject material adhering to the relevant established criteria. PG, for example, utilizes public domain status as a critical gating function; Google Book Search may have gating functions controlled by public domain status, or by other conditions that are typically bound by copyright on a by-source basis.

Opportunistic content selection results in inadvertent but potentially highly misleading errors. Available or chosen editions may be bowdlerized from the original, or are otherwise not authoritative, and presented without any notice that other editions might exist much less be more appropriate. Duguid cites the example of the novel Pan, written by the Norwegian author Knut Hamson at the close of the nineteenth century. The original work is explicit in its sexual innuendo, but both PG and Google Books present editions that were highly sanitized by editors concerned with the morality and mores of the time in which they prepared their published editions, thereby altering the tone and interpretation of the novel. The edition presented by PG, at least, is finely and accurately prepared, with high fidelity to its source — unfortunately, as with Google Books, it is a profoundly misleading copy, misrepresentative of the author’s intent. Because most readers will not even know that alternative editions exist, they will assume such an altered edition to be an appropriate one.

Duguid surfaces a second form of error, one discovered among works by Trollope, using Google Book Search. Advertisements inserted in the back matter of various editions of Trollope’s books are scanned and indexed on peer with the text of the novels themselves, with no algorithmic intervention halting the indexing procedure at the terminus of the artistic creation. Since these advertisements are common to many books published in the same period by this publisher, the text of the advertisements rank high in Google’s “popular phrases” even though they have nothing to do with the novels in question, except as a cultural artifact. In this case, the universalistic assumption of the “novel” as a product largely free of commercial content is hazardous in its implications for online search and retrieval.

3. Wikipedia. There are many examples of persistent errors on Wikipedia. Duguid highlights problems with the entry for the novelist Daniel Defoe. Noting many errors in the entry describing the author, Prof. Duguid recounts how he attempted to correct the entry, only to be treated as a vandal. Prof. Duguid also provides documentation suggesting that the Defoe entry has generally worsened in its historical accuracy over time, rather than improving. He also cites a litany of relatively minor mistakes in the article on the writer James Joyce. None of the corrections he enumerated have been accepted.

Perhaps more insidious, user submitted information, correct at the moment and place, can be inadvertently stranded on Wikipedia and left to wither into inadvertent falsity. As an example, data describing the Irish cabinet, documented relevantly within the context of other entries (as opposed to the root Wikipedia entry on the Irish cabinet) is rapidly obsolesced as governments change. There is no easy way to surface such discrepancies across the volume of material. This type of error produces strands of growing and often unobserved information decay over time.

In summing across this presentation of different types of errors in online content systems, the inadvertent nature of the majority of problems, excepting the persistent and nagging vandalisms of Wikipedia articles, is apparent. Duguid observes that introduced errors such as those in Wikipedia are sometimes frivolous, albeit sometimes in line with Russell’s omission of Wittgenstein from his History of Western Philosophy — “a scandal, but a willful not a bureaucratic one” — as noted by Frederic Raphael.

The occurrence of errors such as these is modest, and the capacity for their formulation generally rare; more profound problems surface from fundamental tenets in the design of the systems themselves. Paraphrasing Chaucer - the dominant claim of open content systems has been that experience, not authority, is the quality that should be sought.

Duguid observes that a periodic appeal to restorative correction of information through its exposure to open inspection is a recurring theme through history. It was an explicit, architected aim of several prominent scholarly efforts, including, e.g. the transmittal of scientific drafts and treatises by the Royal Society. People have often attempted to solicit and share comment, advice, and information in whatever ways a newly prevalent communication technology has made available. Indeed, one might describe the initial exploitation of new communication technologies as almost invariably falling into an embrace of popular engagement; new communication technology presents a revolution of a sort.

Engagements with openness almost inevitably decay into governing regimes that impose more stringent information order and control. The consequences of early openness in almost all historical forums instigates a slow march toward authority, moderation, and selection as information noise, graffiti, and what we now call spam increase. Openness becomes gradually overwhelmed by insidious exploitations, not usually malevolent ones, designed for intent orthogonal to the original premise of the information communication systems themselves.

Although not called out by Prof. Duguid, his examples of mass digitized book systems also serve a different form of entropy. This is one of greater transparency in implicit errors made possible by architectures of information delivery over the course of time, and their increase in utilization. These contradictions can only be corrected through an assiduous assignment of labor and inspection. The conundrum facing both providers and users of these information platforms is at what point such intervention is considered desirable in light of its cost.

Prof. Duguid closes by observing that expertise cannot speak for itself — it cannot establish its own truth through proclamation. This is arguably one reason why human societies have established and sustained institutions. Paul cites Foucault’s 1969 essay, “What is an author?” which raises the speculation of what happens with the disappearance of authors — in other words, what enters to fill the structural gap in the cycle of information construction, use and criticism once the author departs? Creativity is not a sea that fills voids in its essence with an onrush of wave and foam; the creation itself must be defined.

In sum, Duguid feels that the issue of open vs. closed systems is often not the right question. We must ask instead, “where does open work, and why?” or “where doesn’t it work, and why not?”

Finally, most importantly, we must ask, “What do we need to do in order to make it work?”

Leave a Reply

Please note: All comments will be approved by an administrator before they appear on this page.


Social Science Research Council - 810 Seventh Avenue - New York, NY 10019 - USA | P: 212.377.2700 | F: 212.377.2727 | E: info@ssrc.org