Character Level (Diplomatic) Papyrus Encoding

Version 0.1 (01/02/2001)

OpenText.org Proposal February 2001

Editors:
Matthew Brook O'Donnell
Stanley E. Porter
Jeffrey T. Reed

Copyright (c) OpenText.org 2001-2004

Abstract

There are a series of levels of annotation involved in the encoding of papyrus documents, from character level details through to editorial decisions regarding variant interpretations and readings. These different levels are associated with different text editions, diplomatic (character or base level), reconstructed (word divisions, accents and expansions) and reading (variant readings and interpretations). This document outlines the base diplomatic level for this annotation, which takes place at the character level. It also describes the XML elements and attributes used for this annotation.

Status of this document

This document is the initial proposal of the character level or diplomatic papyrus annotation scheme. It is currently under review and comments are requested. Please post comments to OpenText.org forum.

Table of contents

1. Introduction

2. Definitions

3. Features analyzed for character level encoding
3.1. Encoding of character data
3.2. Spacing, lacunae and edge marking
3.3. Relationship between character level encoding and standard Leiden conventions

4. Elements and attributes for character level annotation
4.1. <papyrus> element
4.2. <recto> element
4.3. <verso> element
4.4. <line> element
4.5. <c> element
4.6. <lacuna> element
4.7. <space> element
4.8. <edge> element

5. Examples

6. Use of character level papyrus annotation scheme and its components

7. Document Type Definition


1. Introduction

a. The base level of annotation for papyrus texts in the OpenText.org encoding model takes place at the character level. The encoding that takes place at this level parallels the creation of a strict diplomatic text for a printed edition. Each character within a text is marked individually and annotated according to its status and visibility. The annotation of word divisions or the reconstruction of missing or unclear characters does not take place until a higher level of annotation.

2. Definitions

  • [line] A line consists of a series of characters that constitute a flow of text. Frequently characters flow from left to right in horizontal lines. However, due to script style and other physical characteristics of the papyrus this may not always be the case.
  • [character] A character consists of a single orthographic unit, such as a letter or independent marking and their associated decoration. Certain styles of script may join a number of characters or make use of ligatures making the boundaries of a character more difficult to determine. A character may be only partially visible or completely missing (a character-sized lacuna).
  • [decoration] Markings attached to or associated with a particular character (e.g. a stroke above or below a character, accent and breathing marks) are classified as decoration. Such markings are not treated as independent markings, but included in the annotation of the character with which they are associated.
  • [indepenmark] An independent marking is an orthographic unit that is not a letter that functions independently of surrounding characters. Examples might include punctuation marking or structural markings and musical-rhetorical notation. Such markings are annotated as characters.
  • [edge] The beginning or end of a particular line of characters in a papyrus may be written at what is now the edge of the papyrus due to the physical shape of the papyrus or some kind of damage. Such instances, where there do not appear to be missing characters are classified as edges.
  • [lacuna] A lacuna is a missing section of a papyrus. Where possible its size should be marked in terms of the approximate number of characters the missing section could have contained. Gaps that appear to be only one character is size can instead be marked as characters with missing status and no visibility.
  • [space] A space is a section of papyrus that is a noticeable gap between characters. To be classified as a space the papyrus must be present in this gap. A space should be measured in terms of the approximate number of characters it could contain.
  • [visibility] A character's visibility is how clearly it can be seen as an identifiable letter or character on the papyrus. Values are: (1) clear (default), (2) unclear (e.g. only part of character is visible), (3) illegible (some ink is visible but shape and nature of character cannot be determined), and (4) none (the character is missing and cannot be seen at all).
  • [status] A character's status relates its intended presence on the papyrus. Values are: (1) present (default), (2) missing (a hole or missing section where the character would have been), (3) deleted (erasure of the character can be detected), and (4) inserted (the character has been inserted above or between characters or written over a deleted character).
  • [join] A join between two characters occurs when a stroke or amount of ink links the two characters together. In the case of ligatures, where two letters are written very close together or almost on top of one another, each letter is marked as a character with a join between them indicated.
  • 3. Features analyzed for character level encoding

    a. The focus of this version of the character level specification is upon the encoding of characters in a papyrus. Future versions will add more detailed attributes for the <papyrus>, <recto> and <verso> elements that provide the framework for the character encoding. Details to include concern the physical properties of the papyrus, e.g. size, color, quality etc., and additional notes for features such as the style of the hand. As of this version of the specification the <papyrus>, <recto> and <verso> elements are simply container elements for the line and character elements.

    b. Each line in the manuscript is marked with a unique identifier and contains all character, spacing and lacuna elements for that line.

    3.1. Encoding of character data

    a. The boundaries and nature of a character are ascertained on the basis of the criteria outlined in the definition of a character (see Definitions). Regardless of the visibility or presence of a character, if there is a reasonable degree of certainty regarding the original presence of a character at a particular position in the manuscript it is assigned an individual <c> element.

    b. A character's visibility relates to whether it can be clearly seen. Characters whose presence can be easily or clearly seen should be marked as having clear visibility. Where a character is faded, smudged or partially missing, but enough of the character remains to give a reasonable idea of its value (though some uncertainty remains), it should be marked as having unclear visibility. This is parallel to the use of a sublinear dot under a letter in the Leiden conventions (see below). When so little can be seen of a character that it is difficult to make a decision on the value of the character (a sublinear dot with no letter in the Leiden convention), the character should be marked as having illegible visibility. Finally, where a character is missing (has missing status), it should be marked with visibility none.

    c. If the letter of the character can be ascertained with a reasonable degree of certainty, that is, it is classified as being clear or unclear but still legible, then the value of the character should be contained within the <c> element. There are two options for this encoding (1) the standard beta-code values for Greek letters, with the exception of 'c' for c and 'x' for x in reverse of beta-code or (2) the standard Greek unicode scheme (U+0391 to U+03A9 [uppercase] and U+03B1 to U+03C9 [lowercase]).

    d. Each character must be assigned a unique identifier (id) within the document, allowing reference from higher levels of annotation.

    e. Every character is required to have two attributes aside from the id attribute. These are the status and visibility attributes (see Definitions).

    f. A character's status indicates whether its intended presence can be detected in the manuscript, but not necessarily whether or not it is legible or can be interpreted as a particular letter. If the presence of the character is discernable, even if only a short stroke remains, it should be marked as having present status. If, however, it seems that the character has been subsequently rubbed out or deleted intentionally, it should be marked as having deleted status. Characters that seem to have been written over deleted characters or inserted in an interlinear position, should be marked as having inserted status. Finally, where a character space or hole exists in the manuscript, the character should be marked as having missing status.

    g. Both the status and visibility attributes have default values that will be applied to a character unless they are otherwise specified on a <c>. The default values are: status="present" and visibility="clear"

    h. Other optional features marked for a character are the indication of the connection between two characters (with the join attribute) and the recording of markings associated with a character (with the decoration attribute).

    3.2. Spacing, lacunae and edge marking

    a. Although it is common for no space divisions to occur between words in a papyrus manuscript, there are frequently spaces between characters (often in the A to B greeting formulae in private letters, for instance) or indentation. These spaces are marked with the <space> element. An optional attribute size can be specified to indicate the approximate size of the space in terms of characters. This attribute defaults to one character if left unspecified.

    b. Papyrus manuscripts are commonly fragmentary or contain missing sections. As noted above, where a hole or missing section occurs within a line of characters and the number of missing characters can be estimated with relative certainty, the missing characters should be marked with a <c status="missing" visibility="none"><c> element. One </c> element for each missing character. However, it is not always possible to estimate the number of missing characters or it is thought unlikely that the whole of the missing section contained characters. In such instances the missing section can be marked as a lacuna. These missing sections are marked with the <lacuna> element. An optional attribute size can be specified to indicate the approximate size of the lacuna in terms of the approximate number of characters or a standard measurement, e.g. 2cm. This attribute defaults to one character if left unspecified.

    c. The left or right edges of the papyrus are assumed to lie just beyond the beginning and end of each line of characters. However, it is sometimes necessary to indicate that an edge occurs right at the beginning or end of a line, perhaps with half of the first or last character visible. The difference between an edge here and a lacuna, described in the previous paragraph, is that the missing section creating the jagged edge is not thought to contain a whole character.

    3.3. Relationship between character level encoding and standard Leiden conventions

    a. The Leiden conventions for transcription are widely accepted by papyrologists and broadly applied in most printed editions. They are well suited to a single edition of a text in the printed medium as they allow encoding of physical characteristics (unclear and illegible characters, insertions and deletions) as well as editorial expansions, insertions, corrections and interpretations.

    Leiden convention Description OpenText.org character level equivalent Comment
    Character can be clearly identified <c id="c1" status="present" visibility="clear">a</c>
    Character with sublinear dot used to indicate the character is unclear or imperfect in some way <c id="c1" status="present" visibility="unclear">a</c> convention uses sublinear dot to cover a number of character deficiencies. The use of the status and visibility attributes provides a more delicate classification. Future developments of the character level scheme may require further attributes to specify the nature of the uncertainty, e.g. crossbar present, small ink blob in right hand corner, and so on.
    sublinear dot alone used to indicate that the character is illegible <c id="c1" status="present" visibility="illegible"></c> each illegible character (as they can be determined) is assigned a <c> element, allowing higher levels of annotation to provide reconstruction; printed editions must do this through critical footnotes.

    missing character(s) indicated with square brackets; where a series of characters are missing the number is indicated in brackets with sublinear dots or the number itself (e.g. [3] <c id="c1" status="missing" visibility="none"></c>
    <c id="c1" status="missing" visibility="none"></c>
    <c id="c1" status="missing" visibility="none"></c>
    each missing character (as they can be determined) is assigned a <c> element, allowing higher levels of annotation to provide reconstruction; printed editions must do this through critical footnotes.


    ']' at the beginning of a line indicates an unknown number of missing letters (can also be used to indicate the left edge of papyrus); '[' at the end of a line indicates an unknown number of missing characters that extend to edge of papyrus (can also be used to indicate the right edge of papyrus); matching brackets indicate a lacuna of unknown or unspecified character length <lacuna size="3">
    <lacuna size="2cm"/>
    <edge/>
    annotation scheme provides separate elements for a lacuna that may extend to edges of papyrus and for marking the edge where letters do not appear to be missing. <lacuna> element can specify approximate size of missing section in characters or physical dimensions.
    letters in parentheses indicate the expansion or resolution of an abbreviation in the manuscript <c id="c1">e</c>
    <c id="c1">u</c>
    <c id="c1">c</c>

    EXPANSION NOT MARKED AT CHARACTER LEVEL
    at the character level only the characters of the abbreviation are marked; expansion is indicated at a higher level of annotation
    Lost letter that has been restored by the editor on the basis of context or parallel <c id="c1" status="missing" visibility="none"></c> convention conflates character level annotation (the character is missing) with higher level reconstruction (inserting likely missing character). Character level encoding notes that the character is missing and cannot be seen, with no context with <c> element. Higher levels of annotation will associate reconstructed letter with character by reference its id.
    letters deleted in the papyrus; sublinear dots indicate that the erased character(s) are illegible; number of deleted characters indicated by number of dots <c id="c5" status="deleted" visibility="illegible"></c>
    <c id="c6" status="deleted" visibility="illegible"></c>
    <c id="c7" status="deleted" visibility="illegible"></c>
    each deleted character (as they can be determined) is assigned a <c> element with its own identifier. This allows reference to made to the character from higher levels of annotation.
    letters deleted in the papyrus; letters between double brackets indicate that the erased character(s) are legible to some degree <c id="c51" status="deleted" visibility="clear">a</c>
    <c id="c52" status="deleted" visibility="unclear">b</c>
    <c id="c53" status="deleted" visibility="unclerar">g</c>
    each deleted character is assigned a <c> element with its own identifier. This allows reference to be made to the character from higher levels of annotation.

    letters between ticks have been inserted into manuscript, either on top of deleted letters or above/below the line <c id="c51" status="inserted" visibility="clear">d</c>
    <c id="c52" status="inserted" visibility="clear">e</c>
    each inserted character is assigned a <c> element with its own identifier and status="inserted" and visibility="clear".

    b. Other conventions used in the creation of printed editions are discussed in the reconstructed and reading level annotation schemes.

    4. Elements and attributes for character level annotation

    4.1. <papyrus> element


    syntax:<papyrus>...</papyrus>
    function:outside container of papyrus
    use:in-line
    contains:only one <recto> element and an optional <verso> elements
    attributes:NONE SPECIFIED AS OF THIS VERSION

    4.2. <recto> element


    syntax:<recto>...</recto>
    function:defines recto (front) side of papyrus
    use:in-line
    contains:any number of <line> elements
    attributes:NONE SPECIFIED AS OF THIS VERSION

    4.3. <verso> element


    syntax:<verso>...</verso>
    function:defines verso (reverse) side of papyrus
    use:in-line
    contains:any number of <line> elements
    attributes:NONE SPECIFIED AS OF THIS VERSION

    4.4. <line> element


    syntax:<line id="...">...</line>
    function:marks the boundaries of a line of papyrus
    use:in-line or out-of-line
    contains:any number and combination of <c>, <space>, <lacuna> and <edge> elements
    attributes:attributedescription and valuesstatus
    idunique identifier of line e.g. l5REQUIRED
    Example:
    <line id="l2">
    <c id="c11">g</c>
    <c id="c12">o</c>
    <c id="c13">s</c>
    <space size="2"/>
    <c id="c14">k</c>
    <c id="c15">a</c>
    </line>

    4.5. <c> element


    syntax:<c id="..." visibility="..." status="...">...</c>
    function:marks the boundaries of a character and its status and visibility
    use:in-line
    contains:a single character
    attributes:attributedescription and valuesstatus
    idunique identifier of character e.g. c1REQUIRED
    statusindicates the physical status of the character, allowable values are: present (default), missing, inserted, deletedREQUIRED (default supplied)
    visibilityindicates the visibility of the character in the papyrus, allowable values are: clear (default), unclear, illegible, noneREQUIRED (default supplied)
    joinindicates that the character is joined in some means to the (following) character indicated by its unique identifier, e.g. a ligatureOPTIONAL
    decorationindicates various kinds of decoration associated with the character, such as lines above or below; values specified as a whitespace separated list, including: line-below, line-aboveOPTIONAL
    Examples:
    <c id="c1">g</c>
    <c id="c11" visibility="unclear">a</c>
    <c id="c15" visibility="illegible></c>
    <c id="c21" decoration="line-above">i</c>
    <c id="c22" decoration="line-above">c</c>
    <c id="c45" join="c46">g</c>
    <c id="c52" status="deleted" visibility="unclear">e</c>
    <c id="c53" status="inserted">i</c>
    <c id="c70" status="missing" visibility="none"></c>

    4.6. <lacuna> element


    syntax:<lacuna [size="..."]/>
    function:marks a missing section in a line of characters
    use:in-line
    contains:EMPTY
    attributes:attributedescription and valuesstatus
    sizeapproximate size of missing section in charactersOPTIONAL
    Example:
    <lacuna/>
    <lacuna size="3"/>

    4.7. <space> element


    syntax:<space [size="..."]/>
    function:marks space between characters
    use:in-line
    contains:EMPTY
    attributes:attributedescription and valuesstatus
    sizeapproximate size of space in charactersOPTIONAL
    Example:
    <space size="2"/>
    <space/>

    4.8. <edge> element


    syntax:<edge/>
    function:marks an edge of papyrus in a line of characters
    use:in-line
    contains:EMPTY
    attributes:NONE
    Example:
    <line id="l3">
    <edge/>
    <c>l</c>
    <c>o</c>
    ...
    <c>h</c>
    <edge/>
    </line>

    5. Examples

    a. Following examples are lines from POxy. 119:

    line 4
    <line id="l4">
    <lacuna/>
    <c id="c84" visibility="unclear">t</c>
    <c id="c85">e</c>
    <c id="c86">s</c>
    <c id="c87">o</c>
    <c id="c88">u</c>
    <c id="c89" join="c90">e</c>
    <c id="c90">i</c>
    <c id="c91">s</c>
    <c id="c92">a</c>
    <c id="c93">l</c>
    <c id="c94">e</c>
    <c id="c95">x</c>
    <c id="c96">a</c>
    <c id="c97">n</c>
    <c id="c98" visibility="illegible"/>
    <c id="c99">r</c>
    <c id="c100">i</c>
    <c id="c101">a</c>
    <c id="c102">n</c>
    <c id="c103">o</c>
    <c id="c104">u</c>
    <c id="c105" join="c106">m</c>
    <c id="c106">h</c>
    <c id="c107">g</c>
    <c id="c108">r</c>
    <c id="c109">a</c>
    <c id="c110">y</c>
    <c id="c111">w</c>
    <c id="c112">s</c>
    <c id="c113" join="c114">e</c>
    <c id="c114">e</c>
    </line>

    line 5
    <line id="l5">
    <c id="c115">p</c>
    <c id="c116" visibility="unclear">i</c>
    <c id="c117">s</c>
    <c id="c118">t</c>
    <c id="c119">o</c>
    <c id="c120">l</c>
    <c id="c121">h</c>
    <c id="c122">n</c>
    <c id="c123">o</c>
    <c id="c124">u</c>
    <c id="c125">t</c>
    <c id="c126" visibility="illegible"/>
    <c id="c127">l</c>
    <c id="c128">a</c>
    <c id="c129" visibility="unclear">l</c>
    <c id="c130">w</c>
    <c id="c131">s</c>
    <c id="c132">e</c>
    <c id="c133">o</c>
    <c id="c134" visibility="unclear">u</c>
    <c id="c135">t</c>
    <c id="c136">e</c>
    <c id="c137" visibility="unclear">u</c>
    <c id="c138">i</c>
    <c id="c139">g</c>
    <c id="c140">e</c>
    <c id="c141">n</c>
    <c id="c142">w</c>
    <c id="c143">s</c>
    <c id="c144">e</c>
    </line>

    line 6
    <line id="l6">
    <c id="c145">e</c>
    <c id="c146">i</c>
    <c id="c147">t</c>
    <c id="c148">a</c>
    <c id="c149" visibility="unclear">a</c>
    <c id="c150">n</c>
    <c id="c150.1" visibility="illegible" status="deleted"/>
    <c id="c150.2" visibility="illegible" status="deleted"/>
    <c id="c151" status="insert">d</c>
    <c id="c152" status="insert">e</c>
    <c id="c153">e</c>
    <c id="c154">l</c>
    <c id="c155">q</c>
    <c id="c156">h</c>
    <c id="c157">s</c>
    <c id="c158">e</c>
    <c id="c159">i</c>
    <c id="c160">s</c>
    <c id="c161">a</c>
    <c id="c162">l</c>
    <c id="c163">e</c>
    <c id="c164">x</c>
    <c id="c165">a</c>
    <c id="c166">n</c>
    <c id="c167">d</c>
    <c id="c168">r</c>
    <c id="c169">i</c>
    <c id="c170">a</c>
    <c id="c171">n</c>
    <c id="c172">o</c>
    <c id="c173">u</c>
    </line>

    6. Use of character level papyrus annotation scheme and its components

    a. The character level papyrus annotation schema provides the foundational level of annotation in the OpenText.org papyrus encoding model. It is recommended that annotators utilize all of the elements and relationships specified in the document and where possible use <c> elements with unique identifiers for missing sections. This allows for the maximum amount of data to be recorded at the base level, making it accessible to the higher levels of annotation.

    7. Document Type Definition

    a. The following current version of the character level papyrus DTD is to be found at http://www.opentext.org/dtds/papyrus/character01.dtd

    b. It should be included in XML documents using the following syntax:

    <!DOCTYPE papyrus SYSTEM "http://www.opentext.org/dtds/papyrus/character01.dtd">

    
    
    
    <!--
    
    	Character Level Papyrus Annotation
    
    	Version 0.1
    
    	http://www.OpenText.org 2000(c)
    
    	01/02/2001
    -->
    
    <!ELEMENT papyrus (recto,verso?)>
    
    <!ELEMENT recto (line+)>
    
    <!ELEMENT line ( (c|space|lacuna|edge)+)>
    	<!ATTLIST line id ID #REQUIRED>
    
    <!ELEMENT c (#PCDATA)>
    	<!ATTLIST c id ID #REQUIRED>
    	<!ATTLIST c status (present|missing|inserted|deleted) "present">
    	<!ATTLIST c visibility (clear|unclear|illegible|missing) "clear">
    	<!ATTLIST c join IDREF #IMPLIED>
    	<!ATTLIST c decoration #IMPLIED>
    	
    <!ELEMENT space EMPTY>
    	<!ATTLIST space size CDATA "1">
    
    
    <!ELEMENT lacuna EMPTY>
    	<!ATTLIST space size CDATA "1">
    
    <!ELEMENT edge EMPTY>