statistics about the voynich manuscript

Author

Methods & Findings

BIRD (1921)

MOSES (1921)

GARLAND (1921)

CONS (1922)

CONS (1928)

CARTON (1929)

Father Petersen (1883-1966)

Transcription and concordance of the entire manuscript [1]

"There is, unfortunately, nowhere in the material available to me any report of theories Petersen may have held, or conclusions he may have reached concerning the decipherment of the manuscript." [2]

W. F. Friedman's 1st Study Group (FSG, FSG1, 26.05.1944 - 1946)

Transcription of 48'000 characters of text (using a transcription scheme of Roman letters) (based on Father Peterson's photocopies) [3] in order to be processed on IBM punched-card accounting equipment [4]

"Elizabeth Friedman presents the following perspective of the outcome of the First Voynich manuscript Study Group: '(...) demobilization was practically complete before the manuscript was ready for final study. (...)' (1962)" [5]

W. F. Friedman's 2nd Study Group (SSG, FSG2, 25.12.1962 - 1963)

Transcription of 66'000 characters of text in order to be computed by facilities of Radio Corporation of America (RCA) [6]

"(...); higher management at RCA decided to terminate even the minimal 'extracurricular' involvement of their resources, and the group was forced to disband before any definitive results could be obtained." [7]

ASHBROOK (1966)

TILTMAN (1967)

Statistical analysis of the VMS (Stars Section, TILTMANN Transcription) "disregarding all but the most commonly occuring symbols" [8]

Proposal of a template consisting of prefixes (e.g. k-, t-, f-, p-, ok-, ot-, of-, op-, ch-, c+h-, d-, s-), roots (e.g. ok, op, ot, of, d, s) and suffixes (e.g. -or, -ol, -om, -ar or -aiin)

Some suffixes are frequently doubled (e.g. arar, araral).

Some suffix combination hardly ever occur (e.g. al-ar or am-ar ).

Some symbols are both used as suffixes and as prefixes (e.g. or, ol).

Some words occur twice or even three times running (e.g. okeey, ofeey, qokeedy, qokeedy, okaiin, ofaiin, ofar, okar, ofal, ofam, okal, okam, daiin, chedy)

- Paragraphs nearly always begin with f p k t.

- Some symbols appear most commonly at the end of a line, rarely elsewhere (e.g. l or m).

- No substitution cipher or any transposition system.

- No punctation.

- Theory that the VMS script is plaintext of a early form of synthetic language.

Roots

Suffixes

ok-, of-, ot-, op-

qok-, qof-, qot-, qop-

ch-, c"h-, d-, s-

-an, -ain, -aiin, -aiiin

-ar, -air, -aiir, -aiiir

-al, -ail, -aiil, -aiiil

-or, -ol

-ey, -eey, -eeey

-edy, -eedy, eeedy

KRISCHER (1969)

Transcription of 5'500 characters of the VMS (CURRIER Transcription) in order to be statistical analysed on the Digital Equipment Corporation PDP-I Computer [9]

- Frequency counts of characters and words.

- "I cannot find any record of any subsequent results, however. This [...] computer project [...] seems to have been terminated [...] before it came close to achieving any useful results." [10]

ZIMANSKI (1970)

CURRIER (1976)

Statistical analysis (frequency of words and characters) and visual examination of the VMS

- Computer aided frequency analysis of initial, middle and ending letters of words in a line (based on a sample of 15'000 words)

- Computer aided frequency analysis of words in line breaks (based on a sample of 25'000 words of the Herbal and the Balneological Section) [11]

- The occurrence of certain characters is governed by the preceding character.

a +

m l r n (letters starting with a stroke like i)

a +

o +

n, in, iin, iiin

l, il, iil, iiil

r, ir, iir, iiir

m, im, iim, iiim

e +

o d y s (letters starting with a stroke like e)

- Letters t and k are often followed by e (50% in 1'500 occurrences).

- Letters p and f are never followed by e.

- some word initials often followed by specific finals.

word initial

word final

ch- Sh-

-n -an -ain -aiin -aiiin

-l -al -ail -ol

-r -ar -air -or

- some word initals often occur after specific finals of the preceding word.

preceding word final

following word initial

-y -dy -ody

qo-

-n -an -ain -aiin -aiiin

-l -al -ail -ol

ch- Sh-

- l-, r-, n-finals are rarely followed by l- or r-initials.

- Ligatures cTh, cKh, cPh, cFh never occur as line initials.

- Letter k could be a latin abbreviation for TINUS.

- Letter d could be Etruscan, Lydian or Lemnos with the value F.

- Letter y could be a latin abbreviation for CON-, CUM- or -US.

- Identification of at least four different hands and two different languages (A and B)

- First half of the herbal section is in one hand and one language (A)

- Second half of the herbal section is in two hands and two languages (A and B)

- Balneological section is in one hand and language (B)

- Pharmaceutical and Recipe sections are in two hands and languages (A and B)

- The line is a functional entity. Some characters may not occur initially in a line.

- There is almost no repeating of words going over a line break.

- The occurrence of certain symbols is governed by the position of a "word" in a line.

Frequency	VMS	VMS A	VMS B
Very high		line initials: chot cTh	chol, chor word final: dy
High	line initials: t k p f word initials: cht chot Sht cShot word finals: y dy ody n an ain aiin aiiin l al ail ol r ar air or	t k p f cTh cKh cPh cFh chot
Medium (i.e. more frequently than expected)	paragraph initials: p f line initials: dch ych	chain, chaiin
Low (i.e. less frequently than expected)	line initials: ch Sh	word final: dy	t k p f cTh cKh cPh cFh chot
Very low		chol chor	chain, chaiin line initials: chot cTh

- No further insight into the construction of the writing system of the two languages A and B. Impossible to categorize or to classify grammatically any words classes or to discern any syntactic pattern. No recognizable use of words for numerical values.

BENNETT (1976)

Statistical analysis (second-order entropy) [12] of the VMS compared to Hawaiian in a version with limited orthography and phonemic [13]

- Second-order entropy of characters is lower than most European languages [14] (i.e. the VMS is more repetitive) [15]

- Hawaiian and VMS has similar second-order entropies. Only some Polynesian languages have second-order entropy as low as VMS text. Typical ciphers do not have a low second-order entropy either. [16]

D'IMPERIO (1978)

summary of statistical analysis

D'IMPERIO (1978a)

Cluster analysis and multidimensional scaling of the VMS (first 350 to 400 characters from each of forty different pages from the herbal and balneological sections)

Findings indicate, that Herbal B and Balneological B are more similar to each other than to Herbal A but that they are still distinct from each other. Also that the folios mostly written in Balneological B are very similar to each other. This is also true for Herbal B folios mostly written in Herbal A are not strongly similar to each other.

PEP-1 Graph Theoretic Algorithm:

- folios ... (Currier page number: 094, 001, 076, 041, 072, 110, 109, 005, 032, 045, 060, 068, 081, 039, 082, 057, 015, 095, 096) are similar (Herbal A).

- folios ... (Currier page number: 075, 065, 066, 083, 084, 089, 090, 079) are similar (Herbal B).

- folios ... (Currier page number: 147, 156, 149, 151, 148, 150, 152, 153, 154, 155, 157, 158, 059) are similar (Balneological B).

- Balneological B is more similar to Herbal B than to Herbal A.

- folios ... (Currier page number: 094, 076, 060) of the section written mostly in Herbal B are more similar to folios in Herbal A.

- folio ... (Currier page number: 059) of the section mostly written in Herbal B is more similar to the folios in Balneological B.

Ling's (K,R) Algorithm:

- Language A: folios mostly written in Herbal A and folios ... (Currier page number: 094, 076, 060).

- Language B: folios mostly written in Balneological B and Herbal B and folio ... (Currier page number: 059).

HICLUS Agglomerative Cluster Analysis:

- folios mostly written in Herbal A (including folios ... (Currier page number: 076, 060)) are loosely conglomerated

- folios mostly written in Balneological B (including folios ... (Currier page number: 059, 079)) are very similar

- folios mostly written in Herbal B are very similar

- folio ... (Currier page number: 094) is very distinct from the other folios

TAXMAP-2 Clustering program

- three clusters (Herbal A, Herbal B or Balneological B) and many scattered folios

MINISSA Multidimensional Scaling Program

- folios mostly written in Herbal A (including folios ... (Currier page number: 001, 041, 072)) are scattered at one quarter of the two vectors

- folios mostly written in Herbal B (including folio ... (Currier page number: 076)) are scattered around one axis of the second vector

- folios mostly written in Balneological B (including folios ... (Currier page number: 059, 060, 079)) are scattered around the end of the axis of the second vector

- folio ... (Currier page number: 094) is alone at one end of the first vector, very distinct from the other folios

D'IMPERIO (1979)

Statistical analysis (frequency of words, word substrings and characters, Hidden Markov Model (HMM) called "PTAH") of the VMS (samples of text from Biological B, CURRIER Transcription) to discover and demonstrate regularities of patterning

- The plain text directly underlying the Voynich text is probably not a natural language represented by an alphabet in single letters, like English or Latin.

- The Voynich text probably does not involve any form of simple substitution or alphabetic plain text like English or Latin.

- short plaintext word parts may underlie the Voynich script.

- it is more similar to a script like Linear B (a language involving consonant clusters and closed syllables, written down in a syllabary designed for a language having only open (VC) syllables)

- The Voynich text probably does not represent a natural language, written in an "impressionistic" way (to recall a statement by Dr, Robert Brumbaugh, who claims to have deciphered it as a misspelled, distorted form of Latin.

- nor can its characteristics be explained by hypothesizing many variant spellings of the same words in an alphabetic writing system (cf. older forms of English). Its structure seems far too ruly and regular to accord with these views.

- Rather than a distorted or degraded form of English or Latin monographic structure, it seems to exhibit a DIFFERENT structure of its own.

- The words of the Voynich text do not appear to act like code groups in a known code which includes groups for grammatical endings

- "[...] the number of unique word parts [...] is very close to that required if Voynich text elements were assumed to stand for plaintext consonant-vowel syllables after the fashion of a syllabary." [17]

Letter Analysis

letter state	probability > 0.01	probability < 0.01
beginners	o a f
post-beginners	n l m n r q	k in t s p
middles		e c"h ch cFh
pre-enders		d cKh cTh
enders	iim iir iiir y	iin ir

Word Analysis

word state (probability > 0.6)

line initial

final

a / o

final

c"h / ch + y

final

k / t + dy

c"hcKhy

dain

dc"hedy

dchedy

lol

pchedy

qokchdy

sain

sal

sor

tchedy

kain

okain

olkain

olor

otaiin

qokain

qokar

aiin

c"hcThy

c"hecKhy

c"hedy

c"heedy

c"heey

c"heky

C"hey

chcThy

checKhy

cheeky

cheey

cheky

olchey

lc"hedy

okedy

olc"hedy

olkedy

olkeedy

otedy

otey

qokedy

qokeedy

qotal

qotedy

Word Part Analysis

word part state	probability > 0.005
beginners	so- qo- qol- p- l- o- od- ol- or- r- c"hy-
post-beginners	-sc"h -q -qolk -qok -qot -d -dch -dc"h -yk -yc"h -pch -pc"h -lk -lch -lC"H -k -olk -ok -ot -t -tch -cTh -ch -chcTh -chcKh -fch -cKh -c"h
middles	-e -ee -eee -a
enders	-aiin -am -an -ar -dy -iiin -iin -y

GUY (1991)

GUY (1991a) [18]

Statistical analysis (vowel and consonant identifying algorithm) of the VMS (two pages of the Balneological section ('language B') based on illustrations in BENNETT (1976)) [19]

- similar word structure to many language families of central and east Asia (Sino-Tibetan and Tai language family). [20]

- Characters o a e il (Currier transcription: O A C G) separate out as vowels.[21]

- Character ee (Bennett transcription: CC) as a vowel. [22]

- Character ch (Bennett transcription: CT) as a consonant. [23]

- Rebuts speculation in BARLOW (1986) that Voynich faked it. [24]

REEDS (1994)

REEDS (1995)

GUY (1997)

STOLFI (1997)

- proposal a system of crust-mantle-core

crust	d y r s n i m g x
core	t p f k cTh cKh cPh cFh
mantle	ch Sh ee

LANDINI & Zandbergen (1998)

Survey of the problem, discussion of statistics of the EVA transcription, Zipf's law, etc. [25]

STALLINGS (1998)

Statistical analysis (absolute and conditional first- and second-order entropy of characters and words) of the VMS (Currier, FSG, EVA and Frogguy Transcriptions) compared to various kinds of text in different natural languages (Latin, Hawaiian, English, Japanese) and even samples of schizophrenic writings

- the consonant-vowel-alternation of VMS text (as determined by the Sukhotin vowel recognition algorithm) is not as strict as in Hawaiianor Japanese

- single character and conditional single character entropies of the VMS are lower than in natural languages

- VMS text does clearly not have its low second-order entropy measures solely because of a repetitious underlying text. Second-order entropy of natural languages with repetitious underlying format or subject matter are different to second-order entropy of the VMS text.

- "[...] the nature of [schizophrenic] text itself would not by itself explain the puzzling nature of VMS text."

- Verbose ciphers (e.g. Cat Latin Verbose C), which substitute several ciphertext characters for one plaintext character, can produce the entropy profile of VMS text

- Magical Spells in medieval High German are repititious

LANDINI (1998)

CASANOVA (1999)

GROSSMAN (1999)

LANDINI (2001)

Statistical analysis (word and character autocorrelation measurement) [26]

- VMS words tend to be relatively short compared to other natural languages [27]

- Script follows Zipf's law of word lengths i.e. there is an inverse relationship between the frequency and length of a word. [28]

SUZEK & HAJIC (2001)

Statistical analysis (unigram, bigram, trigram and fourgram entropy and frequency of letters and words, Zipf's law, Shannon index, Position Weight Matrices (PWM), Hidden Markov Model (HMM), character based language modelling, word boundaries) of the VMS (FSG1 Transcription) compared to 8 other languages (English, Latin, Turkish, Slovene, Romanian, Hungarian, Estonian, Czech)

- spaces between words are consistent through the text like in other natural languages

- unigram word entropy of VMS is like any other natural language (10bits)

- unigram letter entropies of the VMS show no resemblance to other natural languages

- bigram word entropies are similar to any other natural languages

- Sh and q (S and 4 in FSG1 Transcription) could be prefixes because of their high probabilities of occurrence at the very first position of the words

- y and iin (G and M in FSG1 Transcription) could be sufixes as can be seen from the reverse matrices

MORNINGSTAR (2001)

Gallows are not null symbols [29]

ITO (2002)

Statistical analysis (left and right entropy distribution of words in word-word- and word-character context (k=2 characters)) of the VMS (TAKAHASHI Transcription)

In the figures only words with frequencies 50+ are plotted.

- in word-word context for most words the right entropy is as large as left entropy (i.e. most words in voynich script are context independent)

- Voynich words could be a kind of phrase in themselves

- in word-character-context (k=2 characters) most words have a larger right entropy (i.e. the variation of the following prefixes tend to be larger than the variation of the preceding suffixes, i.e. most words meet different words on its left side each time they appear)

- in word-character-context (k=2) qo-prefixed words show a larger left entropy (i.e. the variation of the following prefixes is larger than the variation of the preceeding suffixes)

- qokchy often follows after dy-suffixed words and preceeds qo-prefixed words.

- in word-character-context (k=2) Sh-prefixed words show a larger right entropy (i.e. the variation of the following prefixes is smaller than the variation of the preceding suffixes)

- Sheody often follows after dy-suffixed words.

Left and right entropy distribution( HL, HR) of word-word context:
HL [bit]	HR [bit]	English	VMS
4-8	4-8	- Most words
6-8	6-8		- Most words
9	9	- and, the	- daiin
8	8	- of, to, a, was, his, you	- chedy, aiin, ol
6	4	- verbs, nouns - units
3	6	- am, 's - Capital letters

ITO (2002a)

Statistical analysis (left and right entropy distribution of words (and also o-/qo-separated words) in line and paragraph break context) of the VMS (TAKAHASHI Transcription)

In the figures only words with frequencies 50+ are plotted.

- Most words show no tendency to appear more likely at the beginning or the end of a paragraph.

- Most high frequency words tend to have a low left entropy (i.e. the tend to to occur more likley at the beginning of a line) (i.e. in the words of Currier "the line is a functional entity").

- Some o-/qo-separated words have a very low left and right entropy (tam, kam, ly) in line break context (i.e. they show only a few kind of words in their left and right context).

- o-/qo-separated words beginning with k, o, t have a higher right entropy (such words are defined as "red words") and tend to appear more often on folios in CURRIER B than on folios in CURRIER A (i.e. it is proofing Curriers hands or languages).

- o-/qo-separated words beginning with Sh, ch, lSh (Sheedy, cheedy, lShedy, am, dam) have a higher left entropy (such words are defined as "green words") and show no tendency to appear more likely in one of the languages CURRIER A or B

Tendency of Ocurring	Normal VMS words	o-/qo-separated VMS words
Beginning of paragraph		words beginning with k, o, t
Beginning of line	saiin, sain, sol, som, sar, Sho, daiin	words beginning with k, o, t
End of paragraph		o, qo, daiin, Shedy, Sheedy, lshedy, cheedy
End of line	dy, dam, am	words beginning with Sh, ch, lSh (Sheedy, cheedy, lShedy, am, dam)

ITO (2003)

Statistical analysis (left and right entropy distribution of o-/qo-separated words in line and paragraph break context) of the VMS (TAKAHASHI Transcription)

- "red words" are defined as o-/qo-separated words ("word stems") which show a left entropy of less than 4.0bit in line break context (i.e. "red words" are defined as words beginning with k, o, t and tending to appear more often on folios in Currier B than on folios in Currier A)

- as most of these o-/qo-separated words ("word stems") show a relatively high frequency in the text corpus it is very likely that o and qo are prefixes

- the initial character of o-/qo-separated words ("word stems") tends to depend on their left and right entropy:
left entropy distribution	high frequency initial character
small (<4.0bit, "red words", high frequency on folios in CURRIER B)	e, c, g l, m, r, iiin, iin, in p, f,t, k
high (≥4.0bit, "green words", same frequency on folios in CURRIER A and B)	a,y, d, o, s, Sh, ch, cTh, q

- the final character of o-/qo-separated words ("word stems") shows no tendency to depend on the words left and right entropy

LANDINI (2003-03-08)

SAZONOV (undated)

Text analysis of paragraphs (character and word frequency distribution)

- Suggestion of a system of prefixes, midfixes (mantle) and stems (core).

prefixes	qo, o, y
suffixes	dy, d

SAZONOV (2003-03-10)

No description of methods

- Suggestion of a system of prefixes, midfixes (mantle) and stems (core).

- Suggestion of different grammatical categories for each group of pre-, midfixes and stems (no details).

prefixes

qo, o

midfixes, mantle

words starting with character:

ch, Sh, e, c

cTh, cKh, cFh, cPh

roots,stems, core

words starting with character:

k, t, p , f

SAZONOV (2003-03-12)

Text analysis of lines and paragraphs (character and word frequency distribution)

No clear description of methods

SAZONOV (2003a) is not scientific because position in sentence is wrong cf. [PALMER (2014)]

word starting with character:	syntax (structure of words and sentences):
k, t, p , f	- sentence start position - accepting prefixes qo, o
ch, Sh, e d	- sentence end position - accepting prefixes qo, o
o	- "nouns in the main form" - accepting prefix q

SAZONOV (2003-03-18)

Comparative text analysis of paragraphs (character and word frequency distribution)

No clear statement of methods

SAZONOV (2003b) is not scientific because of stating only common places

- "phonetical dependency": "the next word changes itself dependent of a precious [sic] word" because of "strong dependency between the final of the previous word and the initial of the next word"

- "forms of words very often reproduce the same form (the same intial or the same final)"

GOSLEE (2006-10-22)

Statistical analysis (character and word frequencies of paragraphs) of the VMS (modified interlinear transcription file v1.6e6)

- character frequencies of folios in herbal section (CURRIER A) form a distinct group against folios in balneological section (CURRIER b) (i.e. for each section is a different set of characters in use)

- word frequencies of folios in herbal section (CURRIER A) form a distinct group against folios in balneological section (CURRIER b) (i.e. for each section is a different set of words (vocabulary) in use)

- word frequencies of folios in pharmaceutical, astronomical, recipe and herbal 2 section are forming distinct group within the herbal group (CURRIER A)

- these differences in word frequencies suggest that the folios in CURRIER A have diffferent topics than folios in CURRIER B and within CURRIER B group there are several subtopics (e.g. Herbal A and Herbal B)

character frequenency distribution	Herbal	Recipe	Balneological
very high (<0.15)	o
high (0.10-0.15)	ch y	e o y	e o y
middle (0.05-0.10)	a d k l n	a ch d	a ch d
low (>0.05)	Sh e q r g m s t f p i ii iii	Sh l n q r g m s k t f i ii iii	Sh l n q r g m s k t f p i ii iii

- word frequncy distribution in herbal (CURRIER A), balneological and recipes section (CURRIER B) is very similar (<0.05) except daiin as extraordinary common word (>0.15) in herbal section (CURRIER A)

Section	Herbal (CURRIER A)	Recipe (CURRIER A)	Balneological (CURRIER B)
Pages	24	23	19
Number of characters (chars)	6'817	50'203	27'738
Number of words (words occ.)	1'686	10'746	6'236
Number of distinct words (words)	685	3'193	1'413
Mean word length	4.0	4.7	4.4
Mean number of times a word occurs (N occ.)	2.5	3.4	4.4
Percentage of word appearing only once (Pct. unique words)	30 %	30 %	35 %

GOSLEE (2006-10-22a)

Statistical analysis (character and word frequencies of paragraphs) of the VMS (modified interlinear transcription file v1.6e6)

frequency distribution of the first character of a word	Herbal	Recipe	Balneological
very high (<0.20)	d	o	o q
high (0.15-0.20)	ch	ch q
middle (0.05-0.15)	Sh o q t y	a Sh l	ch Sh d l
low (<0.05)	a e g l m n s f k p i ii iii	d e g m n s y f k p t i ii iii	a e g m n s y f k p t i ii iii

frequency distribution of the first character of a paragraph	Herbal	Recipe	Balneological
very high (<0.5)			p
high (0.3-0.5)	p	p
middle (0.1-0.3)	k t	t	t
low (0.01-0.1)	f o q	f k Sh d o q s y	k Sh o q s

frequency distribution of the last character of a word	Herbal	Recipe	Balneological
very high (<0.5)			y
high (0.3-0.5)	y	y
middle (0.1-0.3)	l n r	l n r	l n r
low (0.01-0.1)	m o s	m o s	m o s

frequency distribution of the last character of a paragraph	Herbal	Recipe	Balneological
very high (<0.5)	n		y
high (0.3-0.5)		y
middle (0.1-0.3)	y	m n	l
low (0.01-0.1)	g o r	d l o r s	m n o r s

GOSLEE (2006-10-22b)

Statistical analysis (character and word frequencies of paragraphs) of the VMS (modified interlinear transcription file v1.6e6)

- frequency distribution of gallows and of parapgraph beginning gallows on folios in CURRIER A show any correaltion to folio or paragraph initial letter

GOSLEE (2006-10-22c)

Statistical analysis (character and word frequencies) of three modified latin texts

- character frequencies of the latin texts "show any clear internal groupings"

- word frequencies of the latin texts show a moderate groupings due to their different topics

- character frequency distribution of the latin texts are very similar to each other. the most common word is "et".

Section	latin text 1 Apuleius () de mundo	latin text 2 Lucretius () titi lucreti cari de rerum natura liber primus	latin text 3 Isidous Hispalensis () de natura rerum	VMS Balneological section (CURRIER B)
Pages	39	69	172	19
Number of characters (chars)	39'250	40'598	62'546	27'738
Number of words (words occ.)	6'479	7'266	10'785	6'236
Number of distinct words (words)	3'404	2'699	4'010	1'413
Mean word length	6.1	5.6	5.8	4.4
Mean number of times a word occurs (N occ.)	1.9	2.7	2.7	4.4
Percentage of word appearing only once (Pct. unique words)	22 %	32 %	32 %	35 %

RAJKARNIKAR (2004)

Statistical analysis (Zipfs law, tokens vs. word types)

- Entropy

- letter serial correlation (LSC)

- Zipf's law: similar to natural languages.

- entropy: tend to be similar to natural languages (english: 9-10. vms: 10-11. gibberish: 12).

- LSC: similar to natural languages.

SCHINNER (2007)

- random-walk-like fluctuations [30]

- Probability of similar words repeating in the text at a given distance from each other follows a geometric distribution. [31]

- Long range correlations at distances over 72 characters, which is a little over the average line length. [32]

JASKIEWICZ (2011)

Statistical analysis (letter frequency distribution) of the VMS and about 300 languages to find similar languages to the one which was used in the VMS

- 40 possible languages match.

- top five matches designate three regions: Caucasus (Moldavian, Karakalpak, Kabardian-Circassian), India (Kannada, similarity to Sanskrit) and Indochina (Thai).

- letter frequency distribution of the script behaves similarly to those of languages poor in vowels.

ZANDBERGEN (2011)

Statistical analysis (word frequency) of the VMS (FSG1 Transcription converted to CURRIER Transcription, Petersen Pagination) to identify correlations between any pair of pages

- most frequent word in CURRIER A: daiin (CURRIER Notation 8AM)

- most frequent word in CURRIER B: chedy (CURRIER Notation SC89)

- chedy does not occur at all in CURRIER A

- In general, words either occur in both languages or they occur in 'B' only.

- folios of biological section are very similar to each other (Bio-B)

- folios of stars section are very similar to each other (Stars-B)

- There is no clear evidence for any pages in a third language.

- pages 183-185 (all in <f89>) show a correlation with both A and B pages.

- Page clustering:

1. Herbal-A

2. Pharma-A

3. Herbal-B

4. Stars-B (low correlation with Bio-B)

5. Stars-Bio (high correlation with Bio-B)

6. Biological-B

- Some words are frequent in all clusters: 8AM, SC9, OE, OR, 2AM, which essentially excludes the possibility that the A and B languages are different plaintext languages. A verbose encryption of two different plaintext languages is not at all excluded, though.

- The usual differences between A and B are obvious, mostly: high-frequency -89 in B, almost always preceded by C. In A, -89 is often preceded by O. The unattached finals (AM, AR, AE) are much more prominent in Stars-B than in Bio-B.

- The difference between the two stars dialects is mainly the frequent occurrence of 40FCC9 in Stars-Bio.

- The most frequent Herbal-A words are very short and have either a middle O or A or end with a 9.

- The main difference between Herbal-A and Pharma-A: the relatively frequent SCOE in Pharma-A (and other words ending in -COE) opposed to the more freqent initial Q in Herbal-A.

- The evidence suggesting a commonality between Herbal-A and Herbal-B is very thin. 8AM is still the most frequent word and 4OFAM is not as frequent in Herbal-B as in the other B material. The inverse frequency of 8AM is: 19 for Herbal-A and Pharma-A, 38 for Herbal-B, 50, 54 and 60 for the other three B dialects. One way of looking at the A and B languages may be as a mixture of 8AM-language and something else. Language A is full 8AM-language, Herbal-B is only half, and the other B dialects one third.

- the Herbal-A pages located after the biological section are rather different from the early Herbal-A pages

ZANDBERGEN (2011-02-26)

Statistical analysis (entropy and frequency of single words, first- and second-characters) of the VMS (EVA Transcription in Curva) compared to Vulgate, Latin, English, Dalgarno's Language and Chinese (Pinyin)

- The Vulgate text of Genesis uses fewer tokens than Caesar's De Bello Gallico. This may be due both to the simpler language and the nature of the text in the Vulgate Genesis sample.

- Word counts and entropies are the same in the two transliteration alphabets used for the Voynich MS, but the information per n-graph is lower in EVA than in Curva, as expected. It is believed that Curva better represents single characters in the Voynich MS, so this transliteration system will be concentrated on most.

- The two very different Voynich 'languages' which are herbal-A and herbal-B do not differ very significantly in the information content in the characters and words.

- Voynichese is nearly as information-rich as Julius Caesar's Latin, and significantly more so than the Vulgate version of Genesis

Voynichese is less information-rich than Latin in the first two characters of each word, but compensates by greater variability in the trailer.

- The single word entropy of the Voynich MS is not anomalous. Perhaps an argument that the spaces are for real.

First-character and second-character entropy of Voynich MS text are significantly low, and thus especially also the initial-digraph entropy. However, in Voynichese, the third character of each word is as unpredictable as the second, and the fourth contains much more information than is the case in Latin

- In Latin, words are longer than in Voynichese. Comparing the bits of information after the fourth character shows that:

1) Vulgate has less information in endings than Latin (why?)

2) Voynichese has marginally less information than classical Latin, but uses fewer characters to store this information. Is Voynichese 'less inflected' than Latin?

- Overall, the Voynich MS words contain as much information as Latin words, but the information is more equally distributed over the characters while the words are shorter.

KNIGHT & REDDY (2011)

Statistical analysis (frequency of words and characters) of the VMS (language 'B', Currier Transcription)

- Manuscript contains two different vocabularies

- Last character of every word is generated by one of the HMM states; i.e. the word grammar is a*b; i.e. the script could be an abjad.

- Gallows (k f t p) are not uppercase versions of other letters.

- Gallows (k f t p) are contextually similar to each other.

- There is most likely no punctation in the traditional sense.

- Word length distribution is similar to Arabic.

- Predictability of letters is higher compared to other languages.

- Word-final characters have much lower entropy compared to most other languages, i.e. some characters appear almost exclusively at the ends of words.

- Weak bigramm word order (e.g. very few repeated bigrams).

- Words do not show a significant long-distance correlations.

- Pages probably do have topics, but are not independent of one another.

- The characters p t s y (Currier transcription: B P 2 9) occur more likely at line/word beginnings.

- The character iiim (Currier transcription: 5) occurs only at line/word ends.

- The characters iim im m g y (Currier transcription: L K J 6 9) occur more likely at line/word ends.

AMANCIO, ALTMANN, RYBSKI, OLIVEIRA & COSTA (2013)

Statistical analysis (combined measurements of 29 statistical text analysis methods e.g. first-order-statistics, word frequency, network metrics, adjacent words, context nodes, intermittency concepts, time-series-analysis, cluster analysis, syntax, semantic) on the VMS

- VMS is compatible with natural languages

- VMS is not compatible with shuffled meaningless texts. The word order in the VMS is not established by chance.

- VMS has a much greater number of duplicated bigrams than the expected by chance, unlike natural languages.

- VMS has higher intermittency value i (which is related to large scale distribution of a (key) word in the text) than typically observed in natural languages. The reason for these observations may be the fact that the VMS is a compendium of different topics, which is also suggested by illustrations related to herbs, astronomy, cosmology, biology etc.

- finding the most similar language to VMS will require more efforts (analysis of as many as possible books representing each language)

- identification of keywords (with frequency):

cThy (111)

lchedy (119)

Shedy (426)

Sho (130)

chor (219)

dchor (26)

qokal (191)

qokain (279)

qokaiin (262)

qokeedy (305)

qotain (64)

qol (151)

otedy (155)

olkeedy (42)

lkaiin (49)

MONTEMURRO & ZANETTE (2013)

Statistical analysis (Zipf's law, distribution of words) of the VMS to identify large scale organizational structure

- Identification of content-bearing words

- Analysing patterns of co-occurrence (semantic relationships) of the most informative words

- "In the light of recent research on the emergence of Zipf’s law, the probability that the Voynich text resulted from some kind of stochastic process is drastically reduced." [33]

- Information in word distribution as a function of the scale resembles natural languages.

- Listing of the most informative words which are "both highly frequent and have a strongly non-uniform distribution over the different ‘thematic’ sections" [34]

- Clustering of the most informative words with similar patterns of occurrence (indicating semantic affinity) and their cumulative distribution over the whole text. "[...] the words that are more strongly connected have an evident morphological similarity." [35] High specificity of the occurrence of the most informative words over thematic sections.

- Clustering of sections based on the occurrence of most informative words. Pharmaceutical and herbal section have a very strong connection indicating that "there is a match between the linguistic structure and the illustrations of the text". [36]

- "[...] any model for the hoax’s fabrication must [...] explain in detail how such linguistic-like structures emerged from the process itself."[37] "[...] the description of such [statistical] features [...] were unquestionably out of reach of sixteenth-century mathematics." [38]

RUGG (2013)

Visual examination (text structures, distribution of some syllables) of the VMS (modified Transcription) compared to English and German text

- some syllables show different frequency distributions over the VMS (daiin, qo, dy)

- some syllables have the same frequency distribution over the whole script (ol)

PALMER (2014)

Statistical analysis (frequency of characters) of the VMS (RAYHEL Transcription in Voynich 101)

- character position in words:

first	o q ch, Sh	f p cTh cKh cPh CFh		s y
second	o	f p t k cTh cKh cPh CFh	e ee eee
middle		k	e ee eee
penultimate	a d o		e ee eee i ii
last	g r m n		in iin iiin ir iir	s y

ROUSH & SHI (2014)

Statistical analysis (word length frequency distribution (WLFD, Zipf's Law), word recurrence intervals (WRI), N-gram relationships, Shannon Entropy of Ngrams, supervised learning algorithms (SLA), support vector machines (SVM), multiple discriminant analysis (MDA), line break characteristics, TF-IDF, word rank, collocation, Pointwise Mutual Information (PMI), Leave-One-Out-Cross-Validation (LOO-CV)) of the VMS (TAKAHASHI Transcription in EVA) compared to 8 languages (Chinese, English, Hewbrew, Hungarian, Italian, Latin, Pinyin, Russian)

Computational Language Processing

- VMS shows a middle unigram word entropy of 10.46 bits.

- VMS shows a very high bigram word entropy of 14.81 bits, i.e. there are very few repeated bigrams in the VMS and that the VMS has a very weak word order.

- VMS shows a middle trigram word entropy.

- Bigram Ranking Comparison shows that VMS has a extremely low word order, i.e. the VMS could be a hoax or gibberish or some type of code.

- Based on testing with known languages it was identified that only punctuation characters occur often at line breaks (i.e. the end of a line). This characteristic was the same for all tested natural languages. The characters with significant line break relationships are ‘m’ and ‘g’. The manuscript may use these characters as a form of punctuation, or this may indicate that the text is a code instead of a natural language. Punctation in a transcripion of a 1500's herbal did not follow strict rules and often appeard as a glance as part of the word.

- High amount of unique words per page and per section. It is not possible to tell which words relate to the illustrations without further information. Astronomical, cosmological and recipes section show extremly high percentage of unique words.

- Some common words do only occur with high frequency in specific sections, e.g. qol in the biological section (CURRIER B).

- MDA based on WLFD classified VMS CURRIER A similar to Italian or English and VMS CURRIER B similar to English. MDA based on WRI classified VMS CURRIER A and B similar to English. But the close clustering indicates that WRI is a poor classification feature.

- WLD of the VMS shows similarities to the WLD of Hebrew.

- WRI versus rank order shows that the VMS is similar so Chinese or Hebrew.

Research about 1540's writing

- No strict grammatical rules. Words would be abbreviated, written in shorthand, or have multiple spellings.

- Letters were written differently depending on their position in the word. The structure of a letter at the beginning of the word was remarkably different to that of the same letter in the middle of a word.

- Letters were substituted with symbols depending on the author who wrote the text.

- It was often difficult to distinguish the spaces between words without the context of surrounding words.

- Capital letters were sometimes replaced with lower case equivalents and vice versa.

- Words could be duplicated on either side of paragraph and page ends.

- Some words were continued onto the next line and a special indicator mark was used to show that the word was split. This mark was not used in all cases and depended on the free space left on a page or if the author felt like including it.

ZANDBERGEN (2014-08-02)

Statistical analysis (digraph frequency distribution) of VMS (in Curva)

- When Currier identified his languages A and B, he did this on the basis of the different statistics of the initial herbal pages in the MS, which are identified by the red ('A') and dark blue ('B') crosses. It is clear that these have distinct properties - the clouds do not overlap. He also checked the other pages, and noted more variations, but his criteria for distinguishing the languages did not allow him to see that the overall statistics demonstrate that there is a continuum, and the other (not herbal) pages actually 'bridge the gap'.

- there is a clear correlation between the text and the illustration on each page in the MS

- This does not demonstrate that the text is meaningful, or that the text variations are caused by different subject matter (as suggested in 2012 by Montemurro). If that were the case, the difference between herbal A and herbal B should not exist. The cause of the (statistical) language variation is still unexplained.

- the language usage depends on whether the bifolio on which it is written has a standard size or has additional folds. The 'bridging' between the two languages A and B is located exclusively on the foldout pages. This is an important feature that equally still lacks an explanation, but which almost certainly must be related to the order in which the MS has been created.

Sections:

ha: 86 Herbal A pages in quires 1-7.

hp: 10 more Herbal A pages in higher-numbered pages, among the pharmaceutical section.

ph: 17 Pharmaceutical pages containing text.

zo: 12 pages with Zodiac drawings.

ac: 30 pages with Astronomical or Cosmological drawings (and a few pages contained among them with only text).

hb: 34 Herbal B pages.

st: 23 pages in the Stars or recipes section.

bi: 20 pages in the Biological section.

Languages:

Ah: mostly used in the Herbal-A section, but also in some Pharma pages.

Ap: mostly used in the Pharma section, but also in some of the higher-numbered Herbal-A pages (code hp) and some Cosmological pages.

Bb: used in the Biological section

Bb': used on the central bifolio of the biological section

Bh: used in about half the Herbal-B section

Bhb: used in the other half the Herbal-B section, more similar to Bb.

Bs: used in about half the Stars section

Bsb: used in the other half the Stars section, more similar to Bb.

C: used in the Zodiac section

D: used in the nine-disk rosetta

CHAM & JACKSON (2015)

Statistical analysis (glyph design patterns) of the VMS

Suggesting of a Curve-Line-System (CLS)

Base shape glyph classification		Word structure rules based on glyph base shape
curve-glyphs [c] (based on e)	e o ch c"h d g y s b cKh cTh cFh cPh	curve-glyphs are next to curve-glyphs
digraphs which count as a single curve-glyph [c]	or, ol, ar, al	curve-glyphs are next to curve-glyphs
curve-line-glyphs [a] (based on e and i)	a u	transition-glyph between curve- and line-glyphs
line-glyphs [l] (based on i)	i n m=j r l	line-glyphs are next to line-glyphs
line-glyphs which also could be used as prefixes	r-, l-	prefixes are invariant and could be next to curve- or line-glyphs
non-base-shape-glyphs [E] (not based on e nor i)	q x v z p f k t	glyphs without base shapes are invariant and could be next to curve- or line-glyphs

remaining non-conforming words or suffixes (e.g.):

(possible explanations: words with transcription errors, foreign words or words written with a different system, words encrypted in substitution cipher or nulls)

dlsSho, kydany, skaodar

on, dl, chl, son, iro, id, ile, ilo,

-oin, -ino, -ais

SMITH (2015-03-08)

Statistical analysis (frequency of the characters a and y) of the VMS

- a could be a graphical variant of y depending on the position in a word or following characters

- complementary distribution of a and y in words and complementary second position distribution of characters following after a and y in words

first position + second position characters

middle position

last position

y + ch Sh d (characters starting with e combine with y)

y + k t p

a + i l m r (characters starting with i induce a instead of y)

- a does not occur as standalone character whereas y does more than a hundred times

- a does also not occur in the so called key-like sequences (f49v, f57v, f66r, f76r) but y does (exepct f76r)

- e.g. for the words y / aiin, oky / okaiin, qokaiin / qoky, okaiin / oky, otaiin / oty, saiin / sy, qotaiin / qoty (etc.) the suffix iin causes the final y to transform into a (the case of y / aiin has to be considered as a single character root y)

SMITH (2015-03-09)

Statistical analysis (frequency of the characters o, a, e and y) of the VMS

- rules governing the character y

rule	sequence	modifier	example
y deletion	ey	+ e ch Sh d k t p (characters starting with e)	chey -> chedy chey -> cheedy chey -> cheky
y deletion/replacement with o (o as a refelx of y)	ey	+ o	otey -> oteody
y deletion/replacement with a (a as a refelx of y)	ey	+ i l m r n (characters starting with i)	chey -> chear chey -> cheal

- o occurs in any position in a word, but not before g or n

first position	middle position	last position
o	o	o
	e	ey

SMITH (2015-03-09a)

Statistical analysis (frequency of the characters)

- y and o are defined as Primes

- Theory that words are built up of sections (bodies and tails) sections consist of a prime (a y o). the rightmost character of a body-section must be a prime.

first section (body)	second section (body)	final section (tail)	resulting word
dy		+ iin	-> daiin
cho		+ l	-> chol
qo	+ ky	+ iin	-> qokaiin
o	+ tey	+ dy	-> otedy

- section bodies are (almost) not determined by their prime

section body (+ optional characters, [+ optional characters only in combination])				prime
	s l r	+ ch Sh	+ e ee eee	+ a o y
	p f	[+ ch Sh + ch Sh	+ e ee eee]	+ a o y
l +	k t	+ ch Sh	+ e ee eee	+ a o y
ch Sh +	cKh cTh cPh cFh		+ e ee eee	+ a o y
ch Sh l +	d	+ ch Sh	+ e ee eee	+ a o y
ch Sh +	d			+ y
q				+ o

- section tails are determined by the prime which comes before them

prime	tail (+ optional characters)
a o	+ i ii iii	l r m n
o	+ i ii iii	d s l r m
y deletion		d s

SMITH (2015-03-10)

Statistical analysis (frequency of the characters)

- classifying body sectins into Fore, Free and After sections

Fore

Free

After

- o and y as prime

- tails

- no e sequences

- no ch Sh to the left

- o and a as prime

- tails

- y as prime

- no tails

SMITH (2015-03-12)

Statistical analysis (frequency of the characters)

- theory that lk is a digraph (two symbols which together represent a single sound)

SMITH (2015-03-19)

- theories about the number of sounds in the VMS (vowels, consonants, stops, fricatives, nasal, semivovels, liquids) based on the average number of sounds in natural languages

SMITH (2015-03-24)

- theories about the possibility of a, o and y being vowels

SMITH (2015-03-29)

Statistical analysis (frequency of the characters) of the VMS (line first words of f103r to f116r, modified TAKAHASHI Transcription in EVA)

- clear patterns to which letters begin linefirst words (y d p o s t)

- theory that o or p is stripped from words when they become line first words

- theory that y or d is added to words beginning with ch or Sh when they become line first words

SMITH (2015-04-05)

Text analysis (Sonority Sequencing Principle (SSP)) of the VMS

- theory about the sonority of the VMS characters

- from left to tight less sonorous:

a o y > e ee eee > ch Sh cKh cTh cFh, cPh > k t f p d s r > l

- only vowels:

a o y > i ii iii > m n r d s l

SMITH (2015-06-21)

Statistical analysis (frequency analysis) of the VMS (paragraph first words in stars section)

- defining parapgraph first words beginning with p k f t as Grove Words

- theory that p k f or t is added to words when they become paragraph first words (but some words might already beginn with p k f t)

SMITH (2015-06-24)

Statistical analysis (frequency analysis) of the VMS (paragraph first words in stars section)

- by removing initial p k f or t from paragraph initial words it is found that the frequency of these paragraph initial words becomes more similar to the frequencies of the normal word frequencies (i.e. not line first words)

- theory that p k f or t is added to words when they become paragraph first words (except they already beginn with p k f t)

SMITH (2015-09-21)

Statistical analysis (frequency analysis) of the VMS (bench gallow characters)

- defining cTh cFh cPh cKh as bench gallows

- ch Sh almost don't follow after cTh cFh cPh cKh at all

- theory that bench gallow characters are distinct from normal gallow characters

ZANDBERGEN (2015-06-02a)

summary of graphematical/graphotactical analysis

ZANDBERGEN (2015-06-02c)

summary of analysis of the script (transcriptions, transcribers, transcription alphabets, similarities with other scripts, visual graphetical analysis)

ZANDBERGEN (2015-08-08)

description of the vms and the section

ZANDBERGEN (2015-10-26)

Analysis 2 - Character statistics

ZANDBERGEN (2015-10-29a)

Analysis 5 - Sentences, paragraphs, sections

ZANDBERGEN (2013-11-16)

Analysis 3 - Word structure

ZANDBERGEN (2013-11-16a)

Analysis 4 - Word statistics

Last edited on 17.03.2016

[1] D'IMPERIO (1978) p.41

[2] "At his death, his [Father Petersens] papers were given to William Friedman; they were inventoried at Friedman's request by Tiltman, and are now a part of the Friedman collection at the Marshall Library in Lexington, Virginia." D'IMPERIO (1978) p.41

[3] REEDS (1998) p.3

[4] D'IMPERIO (1978) p.40

[5] D'IMPERIO (1978) p.40

[6] D'IMPERIO (1978) p.41-42

[7] D'IMPERIO (1978) p.42

[8] TILTMAN (1967) p.6

[9] D'IMPERIO (1978) p.44

[10] D'IMPERIO (1978) p.45

[11] CURRIER (1976) p.5

[12] MONTEMURRO&ZANETTE (2013) p.1

[13] STALLINGS (1998)

[14] KNIGHT & REDDY (2011)

[15] STALLINGS (1998)

[16] STALLINGS (1998)

[17] D'IMPERIO (1979) p.88

[18] "Jacques Guy summarized his analysis in Cryptologia as follows: «I transcribed the two folios in Bennett's book and submitted them to letter-frequency counts, distinguishing word-initial, word-medial, word-final, isolated, line-initial, and line-final positions. I also submitted that transcription to Sukhotin's algorithm which, given a text written in an alphabetical system, identifies which symbols are vowels and which are consonants. The letter transcribed CT in Bennett's system came out as a consonant, the one transcribed CC as vowel. Now it so happens that CT is exactly the shape of the letter 't' in the Beneventan script (used in medieval Spain and Northern Italy), and CC is exactly the shape of 'a' in that same script. I concluded that the author had a knowledge of that script, and that the values of CT and CC probably were 't' and 'a'. There's a lot more, but more shaky.»" GUY (1991a) in BAEZ (1992)

[19] GUY (1991a) in BAEZ (1992)

[20] JASKIEWICZ (2011) p.2

[21] KNIGHT & REDDY (2011)

[22] GUY (1991a) in BAEZ (1992)

[23] GUY (1991a) in BAEZ (1992)

[24] REEDS (2001)

[25] REEDS (2001)

[26] MONTEMURRO&ZANETTE (2013) p.1

[27] SUZEK & HAJIC (2001)

[28] KNIGHT & REDDY (2011)

[29] KNIGHT & REDDY (2011)

[30] MONTEMURRO&ZANETTE (2013) p.1

[31] KNIGHT & REDDY (2011)

[32] KNIGHT & REDDY (2011)

[33] MONTEMURRO&ZANETTE (2013) p.2

[34] MONTEMURRO&ZANETTE (2013) p.3

[35] MONTEMURRO&ZANETTE (2013) p.3

[36] MONTEMURRO&ZANETTE (2013) p.4

[37] MONTEMURRO&ZANETTE (2013) p.4

[38] MONTEMURRO&ZANETTE (2013) p.2