- Introduction
- Text Representation
- Text Analysis Techniques and Tools
- Electronic Text Access
- TAPoR and the Knowledge Society
- Researchers involved with TAPoR
For most people an electronic text (e-text) is a digitally formatted document, such as a Web page or e-mail message, which is to be written, retrieved and read. But, for a humanities researcher an e-text is much more than something to be retrieved, it is evidence for careful analysis. For centuries, humanities scholars have studied the meaning, form and use of language. Electronic texts for humanists are the evidence they use for research and the computer provides a new tool for which we are developing techniques for studying texts that in turn are generating new insights into the vast body of textual evidence we use to understand ourselves, our culture, and our future. Text analysis can be used to study patterns of language use, to prepare dictionaries from fulltext databases that help us use our languages, to find themes in literature, to compare genres of works, to compare the style of authors, to identify authors of disputed texts, and to study our history through its documents. There are three ways in which computing humanists use electronic texts in scholarly work.
Text Representation First, we learn through the careful preparation of electronic editions by choosing what to represent and what multimedia objects (like the images of a manuscript or audio clips of an oral interview) need to be incorporated into an edition. Ours is the age that is moving our textual heritage into digital form for which reason research in this area is of vital importance.
Text Analysis Techniques and Tools Second, we build and adapt computer software to implement research techniques with which we can ask interesting questions of electronic texts. As a critical mass of electronic texts becomes available, we are developing new techniques for analyzing and comparing them.
E-Text Access Third, we develop scholarly environments based on electronic texts for research teams and the larger community to study issues through the aggregation of relevant tools, texts, and media. To answer many questions a single text or tool does not suffice, nor do we do research alone. Instead we prepare environments for collaborative research.
When texts are digitized and represented electronically we are preparing new interpretations, and the choices made by the researcher have a long-term effect on the use of the resource. TAPoR will support projects building multimedia rich e-texts for analysis, projects creating texts from multimedia assets, and projects creating new etexts from existing e-texts.
At Victoria, Michael Best's Internet Shakespeare Editions, an international research project, aims to make scholarly, fully annotated texts of Shakespeare's plays available in a form native to the Internet. Best and Siemens (Malaspina) produce new, fully-refereed editions of Shakespeare against a background of quarto and folio transcriptions, scholarly papers, biographical materials, and performance records, both from collections and archives of historical collections, and from current productions of participating companies, festivals, and theatre departments. Similarly at McMaster Christie Carson is working on performance texts of Shakespeare published by Cambridge UP that combine an archive of material about the textual history and performance history of the play with a range of texts that have been used in performance. Electronic editions like these explore the innovative use of the multimedia extensions only possible in digital form with hypermedia. TAPoR will provide Best, Carson and associated researchers access to a suite of tools with which to manage assets, prepare multimedia rich editions and analyze these editions.
The Bertrand Russell Research Centre (Griffin) at McMaster is publishing The Collected Papers of Bertrand Russell. Russell (1872-1970) is one of the twentieth century's major philosophers, and in a parallel intellectual life as social critic, political thinker, peace activist and humanist, he addressed many issues of vital import to the history of the twentieth century. Since its launch in 1980 BRRC has published fourteen of an anticipated thirty-three volumes of Russell's writings. Over the years the project has used different typesetting schemes to prepare their volumes and they now need tools that can bridge these schemes in order to create consistent electronic texts for editorial and research aides such as indexes and concordances. BRRC is also preparing to digitize the approximately the 40,000 letters in the Russell Archive. The digitization and transcription of these letters represents an enormous task for which scanning infrastructure and servers are needed.
At Victoria, Alberta, and New Brunswick researchers are producing electronic texts that are derived from multimedia records. Three researchers in Linguistics at Victoria (Hukari, Carlson, and Saxon) are working on digitizing collections of audio-taped interviews with Coast Salish, Dogrib, and Spokane band members, many of whom are now dead. This irreplaceable oral-text collection is central to the understanding of west-coast aboriginal culture and history and will be a major resource for retrieval and analysis given appropriate infrastructure to support it. It can only be accessed through a text portal with the ability to handle digitized audio assets with texts and capable of handling the intellectual property constraints that arise when working in this field.
Digitizing texts provides a fascinating opportunity to generate new texts from old texts. Researchers at Toronto, Alberta, and New Brunswick are working on different models for text generation.
From 1996 to 1999, Ian Lancashire at Toronto published freely on the Web the Early Modern English Dictionaries Database (EMEDD), over 200,000 word-entries from 16 dictionaries and hard-word glossaries from 1530 to 1657. The EMEDD is being expanded now into the Lexicon of Early Modern English (LEME). It assimilates language data from hundreds of printed and manuscript sources from 1480 to 1700. From this database will be generated a dictionary-like interface to what Shakespeare's contemporaries said about their own language. TAPoR would provide the tools with which to implement this model.
The Dictionary of Old English (DOE) (Healey), one of the most important projects currently underway in English historical linguistics, is based on a comprehensive examination of the surviving evidence. The DOE has assembled a text database of at least one copy of every extant English text between 600 AD and 1150 AD. This 40 megabyte text database (approximately six times the collected works of Shakespeare) is marked up following the 1994 Guidelines issued by the Text Encoding Initiative. The project needs tools that can perform sophisticated searches on a database of this size. For example, the spelling of a word like "world" is inconsistent through texts of this period. "World" can be spelled "woruld" or "weorold" or "weruld". As is apparent, the ability to do regular expression searches would greatly enhance the research capability of the DOE Corpus. Tools like TACT that can do sophisticated analysis cannot handle corpora of this size. The DOE researchers need tools that can use the inserted markup, can work with large text databases, and are designed to survive for the duration of a project that will continue for at least another decade. Further, the published dictionary itself is being prepared as a text database with markup indicating the different fields of each dictionary entry. DOE needs tools that can search the highly structured information in their electronic dictionary and can connect this new database with the corpus of texts from which it draws its quotations.
Healey, of the Dictionary of Old English, and Lancashire, of the Lexicon of Early Modern English, with Brian Merrilees and Russon Wooldridge, of the Department of French, have a common research agenda as they are conducting lexicographical research using electronic texts. Together they develop historical and early dictionaries of the English and French languages. With access to sophisticated text-analysis tools they will create innovative electronic resources for mapping the English and French languages, and, as mentioned above, develop technologies for making texts generate their own lexicons from huge text library of the period.
Burk and Fisher at New Brunswick are working on problems of generating useful metadata (information about a text that can be used to find it in a larger corpus of texts or information that allows one to extract relevant components) automatically from well encoded texts. Miall from Alberta is working and Anthony J. Harding, two well known Romanticism scholars, are working on the generating indexes for Coleridge from e-texts. Clements (Alberta, Project Director) and Brown (Guelph) are members of a major SSHRC funded collaboration, the Orlando Project, an electronic textbase that will offer a literary history of British women in the form of biographical, historical, and literary-critical material made searchable and retrievable by appropriate access tools now being developed by the project. Orlando's impending delivery of its first version in 2003 would benefit greatly from a portal that not only makes the materials available to researchers but allows the textbase to continue to expand with contributions from external scholars. All of these projects are adapting or need to adapt text analysis tools to their unique content so as to generate new research and new research tools.
Research in the humanities is undergoing a radical transformation as computers allow us to create new types of analysis tools that can help ask new questions and generate results not possible from print resources. A number of the projects associated with TAPoR are creating innovative electronic texts for which the tools are needed to take advantage of their features. For this reason a significant research thrust of TAPoR is in the area of research techniques and tools.
The Performance in Victorian Hamilton project (Hall and Rockwell) is gathering a database of information about performances in Hamilton from 1846 to 1896 that includes the full text of ads, reviews, and letters in a rich database that will allow researchers to closely follow entertainment in a Canadian city in the Victorian age. This, and projects like Orlando, are enabling new forms of research through hybrid configurations of database and text analysis tools. The TAPoR infrastructure will benefit these and other projects by further developing, disseminating, and generalizing such tools.
At McMaster, researchers in French and Multimedia (Rockwell, Sevigny, Jeay, Mactavish) are developing multimedia research works that combine streaming audio and video with textual documentation into hypertexts suitable for new forms of scholarship. Jeay and Rockwell, for example, are building a SSHRC funded hypertext of French medieval poetry where there is a significant use of lists as a rhetorical trope. These poems are being combined into a hypertext that allows researchers to compare lists of foods, clothing, curses, etc. that will be combined with multimedia reenactments of the oral poetry. Likewise at Victoria researchers are working with oral records that need appropriate hybrid tools that can analyze multimedia enriched texts. TAPoR will provide the infrastructure to use tools to study such multimedia hypertexts and to create a scholarly environment for others to study them.
Integral to TAPoR is providing an environment that brings together researchers in the humanities with computer scientists and information scientists to develop new techniques for text retrieval, text discovery, and text representation. For this reason an important component is an interaction lab at Toronto (Toms) where empirical research into the use of text environments can be conducted. TAPoR presents a unique opportunity to study how humanities scholars access and use e-text and to explore and understand the best approaches and techniques for presenting e-texts. The infrastructure enables a national test bed, while locally the tools are provided for systematic data collection and analysis, neither of which can currently be done. Elaine Toms and others ( Andrew Clement, Wendy Duff and David Mojeska) at the Faculty of Information Studies will address these issues.
At McMaster University, the Humanities Computing Centre develops text-analysis tools such as TACTWeb. Rockwell has also developed visualization tools like SIMweb that supported by TAPoR infrastructure can be scaled out for general use. TAPoR infrastructure will allow Nickerson (New Brunswick) to investigate adaptations of spatial data structures supporting fast search of large scale, hierarchical text and image databases used for text analysis. Nickerson and graduate students are currently investigating how to combine the best known text search data structures with the best known spatial data structures for large scale integrated (text plus geographical range) range search. Having the distributed TAPoR portals will allow them to test such an integrated search data structure for use in such a distributed environment.
Research in the humanities like other disciplines is often done by networked teams that need access to research environments where they can collaborate and document their research. TAPoR brings together large collaborative research projects that need the infrastructure to support their interaction, the infrastructure to allow them to bid on research contracts, and the infrastructure to allow them to study access on a national scale.
At the Université de Montréal, LexUM, led by Daniel Poulin, is studying the use of technology to better the circulation of legal information. Since it set up the first Web site for Law in Canada in 1994, LexUM has been at the forefront of using technology to open access to the law through innovative research resources. The major achievements of LexUM involve massive on-line publishing of case law (Supreme Court, Federal Court, Tax Court and many others.), of legislation, and multi-lingual international law collections. LexUM is also involved in research into standards development for the Canadian judiciary (Citation and Preparation Standard). With 10 years of experience, LexUM is only one of very few teams specialized in this field in the world. Team members carry out research contracts and subsidized research projects that will benefit from stable and well supported infrastructure. In this context, TAPoR will provide a way for LexUM to exchange and share expertise with other Canadian specialists in automated electronic text tagging and metadata extraction and representation.
At McMaster, Coleman set up and became the first Director of the Institute on Globalization and the Human Condition in 1998. He has built a membership of some 30 scholars from the Faculties of Humanities and Social Sciences and worked with these scholars to build a research agenda for the Institute, solicited the interest in this agenda of an additional 20 scholars from across Canada and has begun to build relationships with globalization centers outside Canada. The Institute has been invited by SSHRC to submit a full MCRI proposal in this area. The proposed research of this team would be coordinated and aggregated through an e-text and multimedia portal (developed by Rockwell and Mactavish) which will deliver an Electronic Encyclopedia of Globalization. To do this the Institute, like the Orlando project, will benefit from access into a collaboration environment where abstracts, research summaries, bibliographic references, maps, time lines, and other information can be input over the network and then reviewed and delivered back to the research community.
The infrastructure afforded by TAPoR will allow the Electronic Text Centre at New Brunswick to further its research into advanced metadata architectures. Currently, New Brunswick serves the Canadian e-learning repository community and Canada's SchoolNet by providing metadata systems adapted to the specific needs of public education. The research challenges we have faced in this area have been many: how to address relationships among electronic objects, how to express adequate levels of granularity in indexing, how to manage copyright information, and how to create a system tailored to end user needs that also meets sophisticated indexing demands. Implementing a metadata architecture for TAPoR covers many of the same issues but poses one serious research question that is yet not addressed: how to best facilitate resource discovery of advanced research materials in the TAPoR brings together a significant network of researchers with an established record of internationally recognized work in the field. TAPoR is also being designed for the novice computer researcher. The greatest impediment holding back the sophisticated use or computer research resources and tools is their inaccessibility in the research community. When people can't experiment with computer tools they don't understand what others are doing. For the humanities to embrace in an appropriate fashion the power of computer assisted techniques there needs to be a graceful entry portal for scholars at large. There has to be a place where researchers and students can turn to for a virtual workbench set up and maintained by scholars in their field. Thus, TAPoR is not just a portal for those committed to such research, it is a portal for the expertise of Canada's first generation of computing humanists to be shared with the next generation of scholars who, whatever else happens in the humanities, will be using electronic texts more and more.
The knowledge society now depends in critical ways on electronic texts because human readable text is still the major form in which we handle information. New ways to create, represent, manage, and retrieve electronic text in its various forms would represent a major advance for both the commercial and academic communities. In particular, innovative methods for managing and analyzing text on the web are now urgently needed as we are threatened with a rising tide of information. TAPoR will achieve this end by drawing on advanced research in history, information studies, linguistics and literatures, law, metadata, and multimedia to better our ability to draw knowledge from information.
TAPoR will provide infrastructure to develop new techniques for the analysis of meaning in the "Semantic Web", which Tim Berners-Lee, the director of the WWW Consortium and the inventor of the Web, identified as the next evolution of the Web. Words, the primarily building blocks of human thought, are also the most basic unit for the Web and scholarly text analysis. Unlike the precision of chemical elements and the genomic code, words are ambiguous, their boundaries uncertain, their contexts crucial for determining sense. Web search engines illustrate the challenges of representing human language and deriving meaning from dense, meaning-laden language. How can we unpack meanings to enable fast retrieval of complex texts and to customize that retrieved text into a form that is relevant and useful to its audience, and do so accurately, efficiently and elegantly? This is a challenge of humanities research, one that addresses how culture is conveyed, how personal and professional communication is enabled, how law is interpreted and how basic literacy is assisted or impeded. Imagine a vertical portal devoted to text analysis for the scholarly community with access to the tools, exemplary texts, best-practice solutions, documentation, interactive training and a network of peers. Such a portal would benefit the experienced researchers working in isolation, the next generation of textual scholars who need to be prepared for serious work with digital materials, and the larger community struggling to make knowledge out of a sea of information that is still mostly textual.
- Michael Best, University of Victoria (Department of English)
- Susan Brown, University of Guelph (Humanities)
- Joanne Buckley, McMaster (Humanities)
- Alan Burk, University of New Brunswick (Electronic Text)
- Terry Butler, University of Alberta (Teaching and Learning Centre)
- Patricia Clements, University of Alberta (Humanities)
- William Coleman, McMaster University (Political Science)
- Susan Fisher, University of New Brunswick (Electronic Text
- Nicholas Griffin, McMaster University (Philosophy)
- Frederick Hall, McMaster (Humanities)
- Antonette Healey, University of Toronto (Centre for Medieval Studies)
- Madeleine Jeay, McMaster (French)
- Ian Lancashire, University of Toronto (English)
- Andrew Mactavish, McMaster (Humanities)
- Brian Merrilees, University of Toronto (Humanities)
- David Miall, University of Alberta (English)
- Bradford Nickerson, University of New Brunswick (Faculty of Computer Science)
- Daniel Poulin, Université de Montréal (Droit)
- Geoffrey Rockwell - project leader, McMaster University (School of the Arts)
- Ray Siemens, Malaspina University College (Humanities)
- Stéfan Sinclair, University of Alberta (Humanities Computing)
- Elaine Toms, University of Toronto (Faculty of Information Studies)
- Russon Wooldridge, University of Toronto (Humanities)
