A Fast, Practical Method For Checking the Accentual Structure and Integrity Of Tiberian-Pointed Biblical Texts

Richard L. Goerwitz III


Computers and Cantillation

Over the last twenty years, work on the Tiberian accentual system has fallen increasingly under the spell of the computer. In the late 70s, for example, G. E. Weil utilized a machine-readable biblical text to create an accentual concordance for parts of the Hebrew Bible (Weil 1978). More recently, James D. Price has tried to formalize the Tiberian accentual system as a context-free "grammar" - one that he could implement as a simple computer program (Price 1990). Right now there is a dissertation in progress at the University of Texas (Austin) based, in part, on an automated parsing system capable of diagramming and manipulating the accentual structure of the Tiberian text (Churchyard, forthcoming).

Arcane as these systems might seem, they have many practical ramifications. G. E. Weil, for example, used his concordances to further the work he did on the Biblia Hebraica Stuttgartensia masora. James Price's accentual "grammar" provided him with a ready means of developing and checking his non-Wickesian theories of how the Tiberian accents work. Henry Churchyard now is using his parsing tools to settle the debate, initiated by E. J. Revell (1976), over whether pausally vowelled forms reflect an older, simpler reading system than the one expressed in the accents, or whether the vowels and accents merely represent separate facets of a single underlying linguistic system.

Though in one sense very practical, such systems directly benefit only a small circle of scholars. This is certainly not due to any inherent limitations to the technology. There are many areas in which these same methods might be used to solve problems affecting a much wider community. Most obvious of these areas is the production of standard biblical editions. Such editions import more errors into the accentuation than into any other facet of the text. Even Biblia Hebraica Stuttgartensia - a work known for its accuracy - contains hundreds of accentual errors. While many of these errors reflect problems with the original manuscript (Leningrad B19a), the rest are errors pure and simple. With the new computer technologies, it is possible nearly to eliminate them.

The purpose of this paper is to sketch out, both in theory and in actual implementation, the design of a system that does just this - i.e., a system that acts like a kind of Masoretic "spell-checker," facilitating removal of accentual errors from modern editions of the Hebrew/Aramaic Bible. This system will enable publishers to attain a level of accuracy not dreamt of since the time of the Masoretes, and to reduce dramatically the time and resources consumed by the usual editing and proofreading cycles.

Context-Free Grammars

In his 1990 book, The Syntax of Masoretic Accents in the Hebrew Bible, James D. Price claims that the accents can be parsed using a simple context-free grammar. Although his book does much to systematize and elucidate the structure of the Tiberian accentuation system, it never actually gets around to offering a full and truly explicit grammar. Why? Because the goal, at least as Price conceives it, is unattainable.

Take, for example, the accent revia. Revia divides tifcha, zaqef, and segolta clauses, and can be repeated as necessary. Revia, however, cannot follow itself too closely. If fewer than three words intervene between one revia and the next, the latter revia is replaced by pashta, unless this pashta would land within two words of the next tevir or zarqa, in which case tevir or zarqa is used. This complex series of replacement rules requires prosodic information simply not available to an accent-only grammar. It also requires complex forward and backward scanning. Such a system lies well beyond the theoretical reach of a simple, self-contained, context-free accentual grammar.

Computational Tractability

The problems with Price's system, it turns out, are not just theoretical, but practical as well. Even if he could have created and incorporated sufficient prosodic information and context sensitivity into his grammar, he would have found that there exist no efficient, reliable methods for programming a computer to process grammars of this sort. For all practical purposes, the most powerful class of grammars that computers can deal with effectively are those processable using a deterministic "pushdown" automaton known as an LR parser (see Aho-Sethi-Ullman 1986). LR-parsable grammars (a fairly limited subset of context-free grammars) can be handled by computers rapidly and with timely, very efficient error recovery. Grammars that fall outside this range, however, typically require elaborate systems and complex, often inefficient, error handling procedures - that is, if, in fact, they can be implemented at all (Mads 1990; note also Tomita 1985).

Given its extremely complex and multi-tiered nature, the Tiberian Hebrew cantillation system would seem, at first glance, to defy automated analysis. Consider for a moment, however, the goal. All we want is a system that recognizes errors in the accentuation of modern biblical editions. To achieve this goal, we do not need to construct a full and theoretically complete parsing system. We can, instead, settle for a simpler, less accurate parser - one that brings us back into the realm of computationally tractable grammars, and allows us immediate access to a wide assortment of well-developed software tools and methods. In a theoretical sense, such a move is "cheating." In practical terms, however, we are making precisely those concessions that enable us to develop a working system.

"Cheating"

The places where "cheating" is most critical - i.e., where we have little choice but to misrepresent the grammar to obtain a working system - come mostly in the area of relative position restrictions (e.g. segolta cannot follow zaqef or atnach) and in the area of replacement rules (e.g. revia -> pashta/zarqa/tevir). Some of these phenomena might be describable within the LR framework, if we were willing to rewrite our grammar in verbose and unnatural ways. For example, one could conceivably handle segolta's distribution by creating one specialized atnach, and two specialized silluq, clauses, i.e., an atnach clause whose first major divider was segolta, and a silluq clause whose first major divider was either segolta or an atnach clause with a segolta. Such rules only complicate the grammar, and render it unnatural. And this all for nought, since they still cannot formalize processes that involve syllable structure, word count, or intuitive judgements like when to use segolta or shalshelet in place of, say, atnach, zaqef, or zaqef gadol. We have little choice, then, but to "cheat" - that is, if we want a practical, automated system, and want it right now.

Though pressed into "cheating" by the limitations of existing technologies, we need not abandon any hope of dealing with verses containing accents like segolta and shalshelet. All we need to do is to extend our grammar so that it accepts all valid constructs, and, in addition, just a few invalid ones as well. For example, to account for the distribution of pashta, we simply ignore the revia -> pashta conversion rule, analyzing all pashtas in the same way. Similarly, to formalize segolta's zaqef-like positions of occurrence, we simply admit segolta as a distinct clause type, roughly on a par with zaqef, and ignore the fact that it happens not to occur in several positions where zaqef does.

Such concessions, it turns out, do not cause many difficulties within a working system. The reason for this is that the vast, vast majority of errors introduced by the editing and typesetting processes consist of omissions, mindless mis-keyings, and exchanges of similar-looking accents, like azla -> pashta. Only rarely are authors or typesetters creative enough to introduce a mistake that happens to correspond to one of the concessions or "cheats" allowed into the grammar. See below on Known Bugs for a few examples of where this does happen. The fact that such oversights do occur does not impinge on the overall accuracy and utility of the system, which depends more on whether it consistently detects the more common classes of errors.

The Base Text

In order for automated accentual error detection to work, we need not only a grammar and a set of machine-readable texts to operate on. We also need our texts in a form that facilitates location and identification of the various accents. A good example of such a setup is the Biblia Hebraica Stuttgartensia version distributed by the Center for Computer Analysis of Texts. This version - originally developed by the University of Michigan under grants from the Packard Humanities Institute and the University of Michigan Computing Center - utilizes a series of two-digit codes to represent the Tiberian accents. For example, the CCAT BHS uses 73 for tifcha, 80 for zaqef, and 92 for atnach. There are a few ambiguous forms, such as 75 (which could be silluq or metheg), but all of these are fairly easy to resolve. The beauty of a simple, clean system like what we find in the CCAT texts is that it is trivial to manipulate electronically. Such a system therefore provides an ideal basis for automated error correction.

Consider, for a moment, the antithesis of the CCAT texts: A proprietary coding system designed to work with a specific brand of typesetting software. Such a system would provide no motivation for distinguishing between, say, mahpak and yetiv, or azla and pashta. Why? Because typesetting codes exist solely to tell typesetting software how, what, and when to print. And, because mahpak/yetiv and azla/pashta look the same, there is no need to use different codes to represent them. It might be added that proprietary coding systems have a limited lifetime - essentially the same as that of the typesetting software they are used with. If they were easily converted to other forms, such schemes might still be useful, but the sad fact is that they usually are not easily convertible. Only making matters worse is that they are not designed as stand-alone information repositories. They also tend to incorporate material that, from an information processing standpoint, only gets in the way.

By way of contrast, schemes like the one used for the CCAT texts serve as efficient information repositories. They do not contain any superfluous information. They convert readily into other formats. And they can be readily accessed, maintained, and corrected. Such schemes, therefore, are what we should be using as the basis for our electronic and printed editions. They can be developed and maintained in a platform-neutral state and then converted, as needed, into this or that (or perhaps several) typesetters' native formats. The wide range of linguistic and scholarly word processing software that can import and export the CCAT biblical texts offers testimony to the validity of this approach.

The Implementation

With a set of texts coded according to some clean, general scheme in hand, it now becomes possible to pass these texts through an automated correction system. The system described here consists of two basic modules: 1) a lexical analyzer, and 2) a parser. The first module, the lexical analyzer, translates the accentual codes present in the texts into a form that the parser can utilize. The parser then translates them into a simple tree (see, e.g., the one given below). The parser itself is highly general and portable. The lexical analyzer, however, must be tailored to the particular coding scheme used. Right now, only one lexical analyzer module is available. This one is geared for the CCAT edition of BHS. I would be willing to write new modules if publishers and archives would make their materials available to me. Those with competent in-house programmers are welcome to compose new modules for themselves, although I would hope that they would forward the results back to me so I could look them over, and make them available, where appropriate, to others.

The most portable and widely implemented programming language these days is ANSI C. The most widely used parser generation and lexical analysis tools are YACC and Lex - all standard components of stock Unix systems, and implemented on many other platforms as well. Because of their portability, C, YACC, and Lex have been used as the basis for my accentual parser - which I simply call Accents.

The YACC portion of the parser consists of a series of rules having the general form:

  silluq-clause   : silluq-phrase
                  | tifcha-clause silluq-clause
                  | tevir-clause silluq-clause
                  | zaqef-clause silluq-clause
                  | atnach-clause silluq-clause

  silluq-phrase   : silluq
                  | mereka silluq
                  etc.

The above rules may be read as:

A silluq-clause consists of either a silluq-phrase, a tifcha-clause then a silluq-clause, a tevir-clause then a silluq-clause, a zaqef-clause then a silluq-clause, or an athach-clause followed by a silluq-clause. A silluq-phrase consists either of silluq or mereka then silluq, etc.

Those who know something about formal grammars will note immediately that these rules are right recursive, and that they produce mostly-binary parse trees, e.g.:


           silluq-clause
                 /\
                /  \
               /    \
         zaqef-clause\
                      \
                       \
                  silluq-clause
                        /\
                       /  \
                      /    \
               tifcha-clause\
                             \
                              \
                         silluq-clause                       
                               |
                               |
                               |
                         silluq-phrase
                               /\
                              /  \
                             /    \
                         mereka  silluq

The actual rules found in the parser source code are more complex than the ones given above, and take into account, for example, that the tifcha clause above must follow the zaqef clause. Still, the general principle is the same. The grammar is simply the part of the system that knows what accentual phrases are made up of what accentual sub-phrases, and this down to the least conjunctive accents (which are merely listed along with the disjunctives that follow them).

The Lex portion of the parser mainly just lists accentual codes, pairing them with new names that the parser will recognize. The following code, for example, handles atnach, segolta, and shalshelet:

  92         { yylval.leaf = "atnach"    ; return ATNACH;    }
  01         { yylval.leaf = "segolta"   ; return SEGOLTA;   }
  65{TEXT}05 { yylval.leaf = "shalshelet"; return SHALSHELET;}

Basically, the left-hand column contains the two-digit patterns to look for. The remainder of each line contains C code telling Lex what to do when it finds the pattern on the left.

In a few cases the lexical analyzer becomes more elaborate that what we see above, as, for example, when deciding whether a given munach + paseq combination is simply that, or whether it is actually the disjunctive accent legarmeh. The lexical analyzer also handles so-called Betacode (used by the CCAT texts to mark books, chapters, and verses). Considerable effort was spent in trying to keep this part of the program out of the parser itself, so as to isolate features peculiar to the CCAT texts within the lexical analyzer. The results were something less than elegant, but better than the alternative, which was to sacrifice the parser's overall portability.

Running the Accents Program

As noted above, Accents is written in (ANSI) C, Lex, and YACC. It comes as a source-code distribution, which means that the user must compile it into executable form. Doing this is not difficult for an experienced C programmer, and on Unix systems is likely to require nothing more than a quick look at the makefile and a make. This process, though, will likely mystify users accustomed to running only prepackaged programs. If you find yourself mystified, just contact your local system administrator and ask him or her to perform the installation. If this is somehow impractical, you are welcome to drop me a line.

Once set up, Accents simply reads the standard input and sends a list of verses it has processed to the standard output. If your system does not support file redirection, then you probably will not be able to run Accents. Under Unix, you would typically type:

  accents -p < name-of-your-CCAT-BHS-file

where name-of-your-CCAT-BHS-file is the name of the file where your CCAT BHS text resides, and where -p is a command-line switch that tells Accents to print trees for the verses it parses.

The trees Accents outputs are not nearly so elaborate as the one depicted above. Rather it uses the simple indented notation shown below (Gen 1:1). The digits at the left-hand side of each line indicate the degree of nesting. Literal accent names (tifcha, munach, atnach, etc.) are listed at the innermost clausal levels, with no preceding digit:

  0 silluq_clause
    1 atnach_clause
      2 tifcha_phrase
        tifcha 
      2 atnach_phrase
        munach atnach 
    1 silluq_clause
      2 tifcha_phrase
          mereka tifcha 
      2 silluq_phrase
        mereka silluq

When invoked with the -p command-line option, Accents reports errors as part of the accentual parse trees it produces. For example in Exod 28:1, BHS has a zaqef where a revia appears to be required. Accents tries to parse the verse as best it can, in this case simply skipping over the erroneous zaqef clause. It then skips to the next tevir, and reports an erroneous tevir phrase, as we see below:

  Exod 28:1
  0 silluq_clause
    1 atnach_clause
      2 tifcha_clause
        3 tevir_clause
          4 pazer_phrase
            pazer 
          4 tevir_clause
            5 geresh_phrase
              munach telishaqetanna azla geresh 
            5 tevir_phrase
              ERROR 
        3 tifcha_phrase
          mereka tifcha 
      2 atnach_phrase
        atnach 
      etc....

If no -p command-line switch is given, Accents merely lists parsing errors as it encounters them, and gives book chapter:verse references for each verse it has parsed, e.g.:

  Gen 1:1
  Gen 1:2
  Gen 1:3
  Gen 1:4
  Gen 1:5
  ...
  Exod 4:9
  syntax error
  Exod 4:10
  Exod 4:11
    ...

Notice of an error precedes reference to the verse that caused the error - at least when running in this mode. If you supply -e on the command line, error messages are suppressed, and only those verses that contain errors are listed on the screen. In other words, if you type accents -e < name-of-your-CCAT-BHS-file instead of the above output, you will not see references for any of the errorless verses, but only -

  ...
  Exod 4:10
  ...

Note that the -e option will also work with the -p option. In this case, Accents will only display trees that contain errors.

Occasionally the accentuation of a verse is just too bizarre for Accents to handle. In Exodus 20, for example, the doubly accented Ten Commandments throw Accents into some rather amusing contortions. Just ignore them. Ditto for the second version in Deuteronomy 5. In most instances, doubly accented verses are not that much of a problem. Accents, for instance, handles Gen 35:22 quite elegantly reporting a single error-series after tifcha. Doubly accented verses are one area in which the Icon-based implementation is really much better than the C-based one. If you are concerned about this sort of thing, and have personnel on-hand who know something about Icon, then, by all means, use that version of the program.

Known Bugs

As noted above ("Cheating"), there are a few accentual errors that Accents will not always catch. These are listed in the comments to the YACC parser code, as found in the Accents source code distribution. Salient examples include 1) pashta for what should be geresh and 2) some cases of yetiv for mahpakh, and the reverse. These are very easy errors to make, but fortunately (although Accents will not always catch them) they are not the sorts of errors that cause problems in the printed text!

I am sure that other bug/features :-) will surface as people use this program. By and large, though, Accents seems to function well enough at this stage for public distribution. If anyone is feeling ambitious, I invite him or her to use the "secret" -d option that Accents accepts. This option causes profuse debugging messages to be displayed. Under UNIX, for example, one might type

accents -d < text 2>&1 | more

You can weed out only those error messages that seem particularly interesting via egrep or an awk/ sed filter. For example, if you want a list of superfluous or unrecognized accents, type

accents -d < text 2>&1 | egrep 'Unrecog' | more

Then wait a while. With the -d option, Accents runs considerably more slowly than it does without. (It runs fastest when invoked with just the -e option.)

Finale

In conclusion, let me just say that I would be happy to help out in any way I can when users encounter bugs, or if users run into texts coded in a way Accents does not understand. Just drop me a note, or even send a letter to my home address. Accents is part of an open research project I am involved in. I believe very strongly that all academic research should be made freely redistributable, and that all academically derived software should be available as a source-code distribution that bears no restrictions on copying or modification - at least none other than to guarantee that it is not used by people who do not share these values. For specifics, see the file COPYING, included as part of the the source-code distribution for Accents.

Report any problems you find or suggestions you have to me, Richard Goerwitz, at Richard@Goerwitz.com

Bibliography

Aho, Alfred V.; Sethi, Ravi; and Ullman, Jeffrey D. Compilers, Principles, Techniques, and Tools. Reading, Mass.: Addison-Wesley Pub. Co., 1986.

Bible. The Leningrad Codex. Torah, Nevi'im u-Khetuvim. Three volumes. Jerusalem: Makor, 1971.

Bible. Biblia Hebraica Stuttgartensia, quae antea cooperantibus A. Alt, O. Eissfeldt, P. Kahle ediderat R. Kittel. Editio funditus renovata adjuvantibus H. Bardtke [et al.] ... cooperantibus H. P. Ruger et J. Ziegler ediderunt K. Elliger et W. Rudolph. Textum Masoreticum curavit H. P. Ruger, Masoram elaboravit G. E. Weil. Stuttgart: Deutsche Bibelstiftung, (pref.) 1977.

Churchyard, Henry. "Topics in Tiberian Hebrew Metrical Phonology and Prosodics." Forthcoming University of Texas Austin dissertation. 1995?

Griswold, Ralph E. and Griswold, Madge T. The Icon Programming Language. Prentice Hall Software Series. 2nd ed. Englewood Cliffs, N.J.: Prentice Hall, 1990.

Hopcroft, John E. and Ullman, Jeffrey D. Introduction to Automata Theory, Languages, and Computation. Reading, Massachusetts: Addison-Wesley Pub. Co., 1979.

Kernighan, Brian W. and Ritchie, Dennis M. The C Programming Language. 2nd ed. Englewood Cliffs, N.J.: Prentice Hall, 1988.

Price, James D. The Syntax of Masoretic Accents in the Hebrew Bible. Studies in the Bible and early Christianity 27. Lewiston: Edwin Mellen Press, 1990.

Revell, E. J. "Biblical Punctuation and Chant in the Second Temple Period." Journal for the Study of Judaism 7:2 (1976), 181-198.

Sedgewick, Robert. Algorithms in C. Addison-Wesley Series in Computer Science. Reading, Mass.: Addison-Wesley Pub. Co., 1990.

Tofte, Mads. Compiler Generators--What They Can Do, What They Might Do, And What They Will Probably Never Do. Berlin and New York: Springer-Verlag, 1990.

Tomita, Masaru. Efficient Parsing for Natural Language: A Fast Algorithm For Practical Systems. The Kluwer International Series in Engineering and Computer Science; SECS 8. Natural language processing and machine translation. Boston: Kluwer Academic Publishers, 1985.

Weil, Gerard E., Riviere, P., and Serfaty, M. Concordance de la cantilation du Pentateuque et des Cinq Megillot. Documentation de la Bible 1. Paris: Editions du C.N.R.S., 1978.

Wickes, William. Two treatises on the accentuation of the Old Testament: Taame emet on Psalms, Proverbs, and Job; Taame kaf-alef sefarim on the twenty-one prose books. Prolegomenon, by Aron Dotan. Library of Biblical Studies Series. New York: Ktav Pub. House, 1970. Reprint under one cover of the 1881 and 1887 treatises.

Yeivin, Israel. Introduction to the Tiberian Masorah. Translated and edited by E. J. Revell. Masoretic Studies 5. Missoula, Mont.: Published by Scholars Press for the Society of Biblical Literature and the International Organization for Masoretic Studies, 1980.


Richard L. Goerwitz

Richard@Goerwitz.COM