Evaluating four readability formulas for Afrikaans

For almost a hundred years now, readability formulas have been used to measure how difficult it is to comprehend a given text. To date, four readability formulas have been developed for Afrikaans. Two such formulas were published by Van Rooyen (1986), one formula by McDermid Heyns (2007) and one formula by McKellar (2008). In our quantitative study the validity of these four formulas was tested. We selected 10 texts written in Afrikaans – five articles from a popular magazine and five documents used in government communications. All characteristics included in the four readability formulas were first measured for each text. We then developed five different cloze tests for each text to assess actual text comprehension. Thereafter, 149 Afrikaans-speaking participants with varying levels of education each completed a set of two of the resulting 50 cloze tests. On comparing the data on text characteristics to the cloze test scores from the participants, the accuracy of the predictions from the four existing formulas for Afrikaans could be determined. Both Van Rooyen formulas produced readability scores that were not significantly correlated with actual comprehension scores as measured with the cloze tests. For the McKellar formula, however, this correlation was significant and for the McDermid Heyns formula the correlation with the cloze test scores almost reached significance. From the outcomes of each of these last two formulas, about 40% of the variance in cloze scores could be predicted. Readability predictions based only on the average number of characters per word, however, performed considerably better: about 65% of the variance in the cloze scores could be predicted just from the average number of characters per word.


Introduction
For almost a hundred years now, 1 readability formulas have been used to measure objectively, or rather predict, how difficult it is to comprehend a given text. 2 Without doubt, the most famous readability formula was developed in 1948 by Rudolph Flesch. His formula predicts reading ease based on two text characteristics: average word length in number of syllables and average sentence length in number of words. 3 In the past century, readability formulas such as the Flesch formula have gained great popularity. As DuBay (2004:2) notes, around the 1980s already more than 200 different readability formulas existed. The field has exploded since computer technology has made it possible to apply readability formulas in an automated way (Benjamin 2012:63). Most word processors now have one or two built-in formulas that can inform users about the difficulty of a text they are working on, and websites exist that allow visitors to enter a text and immediately receive, often at no cost, readability scores according to a number of different formulas. 4 Originally, most of the work on readability measurement was done with English-language material. After these early years, however, more and more language-specific formulas were developed (Klare 1974(Klare -1975, and with good reason to do so: texts in different languages that deal with similar topics do not necessarily share the same characteristics, and the effects of these characteristics on readability may also be quite different. Rabin (1988:66-76) presents different formulas for Chinese, Danish, Dutch, French, German, Hebrew, Hindi, Korean, Russian, Spanish, Swedish and Vietnamese (see also Klare 1974Klare -1975. Although Rabin (1988) does not mention the existence of readability formulas for Afrikaans, 5 two years before her publication two such formulas had already been presented by Van Rooyen (1986).
In the current article, the Van Rooyen formulas are discussed in some detail, in order to assess the strengths and weaknesses of the development process and the usability of the resulting formulas. We then go into the only other two readability formulas for Afrikaans that we could identify. Both formulas were constructed more than 20 years after the pioneering work of Van Rooyen. One of these formulas was developed by McDermid Heyns (2007); the other was developed by McKellar (2008). Next, we present a study into the validity of the four existing readability formulas for Afrikaans. Should one or more of these formulas prove to produce accurate readability predictions, then these formulas would provide a good base for creating automated readability formulas that may effectively help writers to produce more comprehensible texts in Afrikaans. 1 According to Fry (2002:286), the first readability formula was published in 1923, when Lively and Pressey published their "method of measuring vocabulary burden of textbooks".
2 See Flesch (1948:221). 3 The exact Flesch formula reads as follows (see Flesch 1948:223-225;DuBay 2004:21-22): RE = 206.835 -(1.015 x ASL) -(84.6 x AWL) RE (Reading Ease) is predicted readability score, ASL = average sentence length (the number of words divided by the number of sentences), and AWL = average number of syllables per word (the number of syllables divided by the number of words). 4 For English, see, for instance, two online resources: an automatic readability checker (Readability formulas n.d.) and Juicy Studio's Readability test resource (Juicy Studio 2017). For Dutch, see the review in Jansen and Boersma (2013), as well as, for instance, the online Leesniveau Tool (Accessibility 2017). 5 In the same volume in which the article by Rabin appeared, Klare (1988:22) mentions the existence of formulas for Finnish and Afrikaans. However, no references are provided here other than to Rabin's overview in which formulas for both languages are missing.
Our research question is the following: How accurately do the four existing readability formulas for Afrikaans predict actual text comprehension?

The Van Rooyen formulas
In a study that, according to the author, was the first of its kind in Afrikaans, Rien van Rooyen presented two readability formulas she had developed based on two different types of comprehension tests (Van Rooyen 1986:59). In her article, published in Afrikaans, Van Rooyen discusses the various steps required to arrive at a formula based on empirical data, and shows how she took these steps herself. Van Rooyen concludes that despite the differences in the linguistic variables used in the two formulas, the outcomes are highly comparable (1986:65). The article ends with detailed instructions on how to apply the formulas, including definitions of the linguistic characteristics that have to be counted: the numbers of syllables, sentences, concrete nouns, words, infinitive constructions, letters, words with three or more syllables, and pronouns.
Although we appreciate the work by Van Rooyen, some critical remarks should be made here, one being that it is hard to determine on exactly which data her comprehension findings were based.  1986:60). Be that as it may, all participants in this study were learners aged about 12 to 16 years, and all texts were intended to be read by this target group. As Van Rooyen herself points out, the formulas may, by implication, be regarded as valid only for texts intended for learners of about 14 tot 18 years (1986:65).
Next, a practical problem in the application of RE-CL should be mentioned. One of the variables in this formula is the number of pronouns per 100 words. According to Van Rooyen's definition of pronouns (translated here in English) "all pronouns should be included: personal pronouns such as ek (I), jy (you), ons (we), dit (it), and hulle (they); possessive pronouns such as watter (what/which), waarmee (with which) and waarop (on which); indicative pronouns such as daarmee (with that); and other pronouns such as iemand (someone), party (some), enigeen (anyone), daar (there) and iets (something)" (1986:67). From a grammatical point of view, part of this definition is confusing and incorrect. A minor problem is that watter (what) is not a possessive but an interrogative pronoun. More importantly, however, is that waarmee (with what), waarop (on which) and daarmee (with that) are not pronouns but pronominal adverbs. This makes it difficult for users of RE-CL to decide which words in a given text should be considered as pronouns. As a consequence, it may be difficult to decide what figure should be entered in the formula for "number of pronouns per 100 words".
Finally, Van Rooyen does not provide any information about the relationship between the outcomes of her comprehension tests. This leaves it unclear to what extent a text that according to the results from the multiple-choice tests should be considered as difficult would also qualify as difficult according to the results from the cloze tests, and vice versa. Furthermore, Van Rooyen justified her claim about the comparability of the outcomes of the two formulas (1986:65) with the finding that the linguistic variables in both formulas explain about 40% of the variance in the comprehension scores they predict. This similarity, however, is not relevant for the possible resemblance between the predictions from the two formulas. How comparable these predictions are can only transpire from their mutual correlation. The reader is not informed, however, about the relationship between the outcomes of the two Van Rooyen formulas. For the most recent version, see Skryfgoed (2017). 11 We assume that NB is meant to be calculated as number of pairs of brackets per 100 words, and NNS as number of numbers and symbols per 100 words. readability judges was .91; for the McDermid Heyns formula a correlation coefficient of .90 was found. The existence of the Van Rooyen formulas is not mentioned in this article.

The McDermid Heyns formula
Comparing the development of the four formulas discussed above, it is striking that it was only for the Van Rooyen formulas that actual text comprehension data were used. It appears that McDermid Heyns did not present his texts to a sample of readers from the target group at which the texts were aimed. McKellar based her readability scores on estimates from judges. From the information provided in her article it is unclear how often the judges did or did not agree on their assessments. More importantly, it is unclear to what extent these judges may be considered to be experts in this field; hence it is impossible to tell to what extent their assessments would correctly predict the levels of text difficulty for the intended readership.
Since, as far as we know, no independent evaluation studies have been carried out to test the validity of the four existing formulas for Afrikaans, we decided to perform such a study, and to use actual comprehension data from readers speaking Afrikaans as their first language.

Method
To evaluate the four existing formulas for Afrikaans, ten texts written in Afrikaans were selected. First, for each text all linguistic characteristics included as linguistic predictors in one or more of the four formulas were measured. After this, comprehension was measured for each of these texts, using cloze test scores from a total of 149 Afrikaans-speaking participants with varying levels of education. Finally, using both the outcomes of the four formulas and the cloze test scores from the participants, the accuracy with which each of the four formulas could predict cloze scores was determined. More detailed information regarding the method is presented in sections 3.1 to 3.5.

Texts
Ten texts of approximately 300 words each were used. Five texts were sourced from the Huisgenoot, a popular Afrikaans magazine, and five texts were sourced from official government communications in Afrikaans. All texts were retyped, in order to make them appear similar and to reduce possible interference with readability by, for instance, layout and letter font. For an overview of all ten texts, including their titles and numbers of words, see Appendix 1. 12

Participants
One hundred and fifty Afrikaans-speaking people living in the vicinity of Stellenbosch and Cape Town took part in this study. One participant completed only the very first parts of the cloze tests with which he was presented, hence his data were not used in the statistical analyses. The remaining 149 participants varied in gender (42 masculine, 45 feminine, 62 no information available); ethnicity (32 coloured, 32 white, 2 black, 83 no information available); highest level of education (1 Grade 9, 4 Grade 11, 74 Grade 12, 5 certificate, 45 university degree, 18 diploma, 2 other); and mean age (whole group of participants: 31.16 years). For more detailed information, see Table 1. 12 All materials are available from the authors upon request.

Measures: Formula scores
In order to determine the readability scores according to the four formulas, all relevant linguistic characteristics were calculated for each text.  number of characters per word (as per Jansen and Boersma; 2013:59; see section 3.5). 13 We thank Gerhard van Huyssteen for providing us with this list.
Next, the scores found for these characteristics were entered in the respective formulas, and the resulting RE-scores were determined. 14 For the outcomes per text, see Appendix 1.

Measures: Cloze tests
In order to measure actual text comprehension, five cloze tests were taken for each text. In a cloze test a number of words in a given text are deleted and replaced by blanks or dashes. All blanks or dashes have the same length. Participants are requested to substitute each blank or dash with the word they expect to have been there in the original text. The number of correct answers is expressed as a percentage of the total number of omitted words in the text. The average percentage for a sample of readers from the target group of the text is interpreted as a valid measure of the comprehensibility of the original text. 15 As Horton (1974Horton ( -1975 states, both the construct validity and the concurrent validity of the cloze procedure have been established. The construct validity of the cloze test addresses the subject's "ability to deal with the linguistic structure of the language; it is related to the ability of the subject to deal with the relationships among words and ideas". The concurrent validity refers to "the variances shared among cloze tests, reading comprehension tests, reading gain tests, and verbal intelligence tests [which] probably [are] a measure of the reader's ability to deal with the relationships among words and ideas". The construct validity of the cloze test may be explained as follows: reading may be viewed as a "psycholinguistic guessing game", "a selective process [involving] partial use of minimal available language cues selected from perceptual input on the basis of the reader's expectation" (Goodman 1967:126-127; see also Jansen and Boersma 2013:51-52). Readers scan a page and pick up graphic cues guided by their prior choices, language knowledge, cognitive styles and the strategies they have learned. They try to relate the resulting perceptual image to syntactic, semantic and phonological cues stored in memory and make a tentative choice about a possible interpretation of the perceptual input. Then they test their choice for grammatical and semantic acceptability in the context developed by earlier interpretation choices. If the tentative choice turns out to be unacceptable, the process starts anew; if the choice proves to be acceptable, reading continues and expectations are formed about what lies ahead (Goodman 1967:127, 134-135; see also Alderson 2000:17, 19).
What participants in a cloze test are asked to do strongly resembles what is requested from readers of the text on which the cloze test is based. To be able to guess correctly which words were deleted from such a text, participants in a cloze test must possess the same types of skills and knowledge as are required from readers. Both cloze test participants and readers must make use of their language knowledge, cognitive styles and strategies, and syntactic, semantic and phonological cues stored in memory. They must make tentative choices about the words that may have been deleted, and they must test these choices for grammatical and semantic acceptability in the context of the test or the text. 14 In cases where there was doubt, for instance when counting the number of pronouns per 100 words for the Van Rooyen RE-CL formula (see section 2.1), the definition provided by the developer of the formula was followed as closely as possible. As a close approximation for the average number of letters per sentence (Slet), as in Van Rooyen (1986:67), the average number of characters per sentence excluding spaces was determined. Jones (1997) provides a critical appraisal of the use of cloze tests to measure readers' understanding of texts used in the field of accounting. His most important argument goes against the construct validity of the cloze test. In his view "the skills necessary to infer missing words from accounting narratives may be very different from the skills necessary to comprehend accounting text" (Jones 1997:118). In his appraisal Jones, however, does not pay attention to the similarities already mentioned between a reader's tasks and what is asked of participants in a cloze test. These similarities perfectly explain the high level of concurrent validity that follows from the strong correlations found between cloze test scores and scores in traditional reading comprehension tests, as reported in Bormuth (1967;1968), Kamalski (2007) and Gellert and Elbro (2013), for instance.
The cloze test can be administered in different ways. One of the choices to be made is between the fixed-ratio method and the rational fill-in method. The fixed-ratio method requires that the n th word be replaced by a blank or a dash (for instance the first, sixth, eleventh word, etc.). When applying the rational fill-in method, the researchers themselves decide which words will be left out, for instance, only verbs, nouns, adjectives, technical words and/or prepositions (Kobayashi 2002:573;O'Toole and King 2011:129). Already in 1953, the founding father of the cloze test, Wilson Taylor, advocated following the fixed-ratio method (Taylor 1953:419-420). Taylor's plea is supported by the results reported in Bachman (1985). Bachman compared cloze scores of 910 students in total. He reports that the fixed-ratio method and the rational fillin method led to comparable outcomes, and that scores collected with both methods strongly correlated with other comprehension measures (1985:544, 546).
A disadvantage of the fixed-ratio method, however, may be thatdepending on the number of the starting worddifferent cloze tests versions may result in different outcomes: not all words are equally redundant, hence some blanks may be more difficult to fill than others (Bachman 1985:537-538;Kobayashi 2002:581-583), and the meaning of a score depends on which items are chosen from a text (Abraham and Chapelle 1992:474). An easy solution for this problem is to create different cloze versions, as suggested by Bormuth (1967:2) and Jongsma (1971:26): one version starting with deleting the first word, a second version starting with deleting the second word, and so on. As O'Toole and King (2010:314) conclude from their study into the impact of the location of the first deleted word: "it would be wise for teachers and material developers to exhaustively sample the text of the passages whose more general readability they wish to estimate. The generation of five cloze tests from a passage is relatively easy and exact scoring would enable a much clearer picture of the ease or difficulty of the passage to emerge." Another decision that has to be made when administering a cloze test concerns the way in which the participants' answers are scored. The first possibility is exact scoring: a word filled in at a given dash is only considered correct if it matches the word that was deleted from the original text, leaving out of account minor differences in spelling. The second option is to apply conceptual scoring. Following this approach, an answer is also correct if it may be considered as a synonym of the word from the original text. At first sight, conceptual scoring may seem preferable: filling in a synonym seems like a fair indication of text comprehension. Several studies, however, found strong relationships between the outcomes of the two different ways of scoring (see, for instance, McKenna 1976:142;Litz and Smith 2006:55, 68;O'Toole and King 2011:135-139). Taylor (1953:432) was already in favour of exact scoring, as it helps to avoid "the problem of coder reliability that so plagues content analysis". O'Toole and King (2011:140) also conclude that for measuring the comprehensibility of different texts, exact scoring is more appropriate than conceptual scoring; it may be regarded as "easier, less subjective, sufficiently reliable [and] highly correlated with conceptual scoring" and "[exact scoring] does not advantage particular sections of the reading population". Having compared 24 cloze test studies, Watanabe and Koyama (2008) conclude that conceptual scoring leads to more reliable outcomes than exact scoring. It should be noted, however, that their meta-analysis only refers to studies with cloze tests taken in another language than the participants' mother tongue. It seems hard to predict what the results would have been if outcomes had been analysed from studies with cloze tests in the first language of participants. Owing to a larger vocabulary of such participants, their answers may have varied to a greater extent than was the case in the studies that Watanabe and Koyama (2008) refer toand so might the outcomes of the conceptual scoring method.
In view of the mentioned considerations, we decided to follow the exact scoring approach and to apply the fixed-ratio method. The exact scoring approach serves the purposes of both reliability and the speed of scoring. The fixed-ratio method does not require any subjective decisions about the words that should or should not be deleted. To prevent that chance decisions about the starting word would influence the test results, five different cloze test versions were developed for each text. In all versions, every fifth word was deletedin the first versions starting with the first word, in the second versions starting with the second word, etc. In this way we ensured that each word in every text would play a role in the cloze scores we collected, in the same way that each word in the original texts plays a role for readers of these texts.

Procedure
The participants in this study were approached by colleagues and students from Stellenbosch University. All participants completed a consent form, which included information about the purpose, procedures, risks and benefits of the study, and reminded them of the confidentiality of their responses and their rights as participants. 16 After this, each participant was presented with a pack of two cloze tests: one of these cloze tests emanated from one of the Huisgenoot texts, and the other cloze test was derived from one of the government texts. As a result, 149*2=298 cloze scores could be calculated for 5*10=50 cloze test versions, amounting to on average 5*298/50=29.8 cloze scores per text. After this, each completed text was scored. Following the definitions from the developers of the four formulas as carefully as possible, the scores were determined for all linguistic characteristics included in the formulas. Finally, statistical analyses were performed relating the cloze scores to the outcomes for the linguistic characteristics and the predictions from the formulas. One extra characteristic was added: the number of characters per word. In a comparable study carried out in the Netherlands into the validity of Dutch readability formulas, it was found that this characteristic strongly correlated with cloze scores (Jansen and Boersma 2013:59). In view of this outcome, we decided to measure the correlation between this characteristic and the cloze scores we collected.    Table 3 presents the outcomes of correlation analyses, including the cloze scores, on the one hand, and the text characteristics included in the four formulas, on the other hand.

17
These outcomes may seem to contrast with the differences in correlations found between the cloze scores and the Van Rooyen-CL formula (r=.34) and the correlations between the cloze scores and the McDermid Heyns formula (r=.61) and the McKellar formula (r=.64). This contrast may, however, be explained by the limited number of texts involved in determining these correlations. As Table 3 shows, significant correlations with average cloze score were found only for number of pronouns per 100 words (r=.67; p=.03) and average number of syllables per word (r=-.72; p=.02). Entering both predictors in a regression analysis with cloze score as dependent variable revealed that a formula including these two predictors would not lead to predictions significantly correlating with cloze scores. The correlation, however, between average cloze score and number of characters per word proved to be strong and significant: r=-.81; p=.005. In other words, 100*(-0.81 2 )=65.61% of the variance in the cloze scores could be predicted from just the average number of characters per word.
Regression analyses revealed that adding any other text characteristic to this predictor would not lead to stronger correlations with average cloze score.

Conclusions
At present, four readability formulas for Afrikaans exist: two formulas developed by Van Rooyen (1986), one formula developed by McDermid Heyns (2007)  Publications on the development of the existing formulas revealed a number of shortcomings that might help to explain the relatively low level of accuracy of the predictions. Van Rooyen used data only from participants in Grade 8 to Grade 12, and all texts she used were intended for this target group. As a consequence, her two formulas may be valid only for readers with comprehension levels comparable to those of learners aged about 14 to 18 years. In our study, however, text comprehension data were collected from a different and wider group of speakers of Afrikaans.
A remarkable result we found pertaining to the two Van Rooyen formulas was the almost absent correlation between the mutual predictions of these formulas. Van Rooyen (1986) does not mention in her study how predictions from her two formulas were correlated. In future studies it would be interesting to see how the outcomes of the Van Rooyen formulas relate to each other, and also to the outcomes of the McDermid Heyns formula and the McKellar formula.
In the introduction of this article, we alluded to the possibility that, in order to help writers to produce better texts in Afrikaans, automated readability formulas could be created. Should one decide to try to create automated readability formulas for Afrikaans, then a natural option would be to start from existing formulas. Of course, using these formulas for automated applications would only be useful if they prove to produce accurate readability predictions. From this study we cannot conclude that starting from existing formulas for Afrikaans would be a viable option.
A more favourable possibility for the development of automated readability formulas would be to use actual text comprehension data such as those collected in this study. We realise, however, that comprehension data from more texts and more text types would be needed. Therefore, we are in the process of collecting cloze test scores from altogether 225 new participants in respect of 15 new texts in three new text types (medical pamphlets, daily newspapers and insurance brochures). The combined set of data from the present study and the forthcoming study can be related to text characteristics measured by means of advanced language technology. 18 We trust that this will lead to the creation of readability formulas that will outperform the existing formulas for Afrikaans. 19 An important proviso here is that, in the presentation of such new formulas for Afrikaans, the objections against readability formulas raised in the literature should be adequately addressed. DuBay (2004:28-29), for instance, points at the role that prior knowledge and the motivation of individual readers play in text comprehension. In readability formulas, however, such individual reader characteristics are not taken into account (Schriver 2000:138-139;Jansen and Lentz 2008:7). Dreyer (1984:335) and Duffelmeyer (1985:393) bring up the unsubstantiated high level of precision suggested by the decimal figures in many formulas. Schriver (2000:139-140) remarks that authors may be tempted to "write to the formulas" by simply adding more periods in order to make the sentences shorter. Such authors are misled by the suggestion that a correlational relationship, in this case between sentence length and text comprehensibility, would be equal to a causal relationship. Such text characteristics are indices and not causes of text difficulty (see, for example, Singer 1988:vii).

18
Such language technology is presently available for Afrikaans at the Centre for Text Technology (CTexT) at the Potchefstroom Campus of the North-West University (North-West University 2017).

19
A comparable project in the Netherlands is LIN (LeesbaarheidsIndex voor het Nederlands) (NWO 2017).
Despite obvious shortcomings, a possible benefit of using readability formulas is that they remind writers who are not trained as professional authors to be aware of readability issues. 20 Being confronted with unfavourable readability predictions may be an important first step toward creating texts that readers understand and appreciate: forewarned is forearmed (see also Schriver 2000:140). Automated readability formulas grounded in advanced research and presented with sufficient caution may be a real asset for users of Afrikaans.