about OpenText.orgLinguistic model used 
for annotation and corpus 
and annotated textsOnline and print tools for linguistic analysisResources and publications 
for Hellenistic Greek linguistics
Rationale for

What is the rationale for undertaking as a project? In a nutshell, is a web-based initiative to provide an annotated corpus of Greek texts and tools for their analysis. The long term goal of the project is to construct a representative corpus of Hellenistic Greek (including the entire New Testament and selected Hellenistic writings of the same period) to facilitate linguistic and literary research of the New Testament documents. For most of our users, this succinct answer may need further expansion and clarification. We may do this by answering the following two questions.

What Is a Corpus?

First, “What is a corpus and is there a distinction between a text and a corpus?” A text is basically a written discourse that is considered a unit. For instance, the epistle to the Romans is a text, and so are the Gospel of John and the third epistle of John. A corpus, however, is not simply a collection of texts. A corpus “seeks to represent a language or some part of a language.” Thus, a corpus consists of an intentional grouping of particular texts, according to specific criteria. For example, one could collect a corpus of the sayings of Jesus or of the letters attributed to Peter. There is no reason why all the New Testament texts or any combination of the New Testament texts cannot be considered a corpus. As used here, a corpus is “a finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration.” To make it machine-readable, the Greek text of the New Testament is encoded in Extensible Markup Language (XML), which is designed to clearly label data and enable its users to access, manipulate, and repackage that data with ease.

What Are the Characteristics of Corpus-Based Analysis?

The essential characteristics of corpus-based analysis are as follows:

  • it is empirical, analyzing the actual patterns of use in natural texts;
  • it utilizes a large and principled collection of natural texts, known as a “corpus,” as the basis for analysis;
  • it makes extensive use of computers for analysis, using both automatic and interactive techniques;
  • it depends on both quantitative and qualitative analytical techniques.

These characteristics lead to at least four advantages in using a corpus-based approach in studies of language use. First, computers can handle large amounts of language and keep track of many contextual factors simultaneously. Second, actual language usage in the corpus (not just a theoretical construct) is the object of analysis. Third, computer-assisted analysis facilitates the accounting of the extent to which a pattern is found and of contextual factors that influence variability. Fourth, by means of the data-handling capability of computers many previously unfeasible research questions can be asked.