Monday, December 10, 2007

Consonant analysis

The transcription of vowels is really problematic for a simple algorithm - and the transcription of consonants into similar sounding Latin text is likewise - so what do you think of this algorithm? (written simply with replace statement in pl/sql)

) aleph, ` ayin, C= tsade, $ shin, S sin, Y=yod, V=vav, +=teth, bgdkpt - without dagesh, BGDKPT - with dagesh. I don't know if it is important to keep the dagesh or not - I have one root derivation algorithm and I am looking for another without keeping a full online dictionary (which I don't think exists in a relational form yet).

What I 'see' in trying to read Hebrew without vowels is that one never quite knows when a mater is a vowel or when it is a consonant - and one never quite knows without fluency of word recognition whether a vav is a U an O or a Ve or Va or whatever - similarly Yod - when is it an I and when a Y?

Interesting to me when reading the following is that sometimes I can recognize the word and sometimes not. I am never going to read this way - but you know there are all sorts of old articles and books that do this for the lack of Hebrew characters. And further, in the absense of a unicode database, Latin based comparisons are much faster and easier and more likely to yield a match on a visual search. Equally well the reverse transformation is one-to-one so I can sort of check that I didn't omit anything (except the vowels).

)$RY H)Y$ )$R L) HLk B`Ct R$`Ym
VbdRk X+)Ym L) `md VbmV$b LCYm L) Y$b
L)-kn HR$`Ym KY )m KMC )$R-TDpNV RVX
`L-Kn L)-YQmV R$`Ym BM$P+ VX+)Ym B`dt CDYQYm
KY-YVd` YY DRk CDYQYm VdRk R$`Ym T)bd

אשרי האיש אשר לא הלךבּעצת רשעים
ובדרך חטאים לא עמד ובמושב לצים לא ישב
כּי אם בּתורת יי חפצו ובתורתו יהגּה יומם ולילה
והיה כּעץ שתול על-פּלגי-מים אשר פּריו יתּן
וכל אשר-יעסה יצליח בּעתּו ועלהו לא-יבּול
לא-כן הרשעים כּי אם כּמּץ אשר-תּדּפנּו רוח
על-כּן לא-יקמו רשעים בּמּשפּט וחטאים בּעדת צדּיקים
כּי-יודע יי דּרך צדּיקים ודרך רשעים תּאבד


Tim said...

Do you mean that the database can't hold Unicode text only ASCII?

Have you thought of using the Michigan-Claremont scheme with was a standard for Hebrew in the ASCII period ;-)

Bob MacDonald said...

The characters # and & need escaping - it is not a good transcription standard. In fact my database will hold unicode but I am not sure that is an advantage. Pointed unicode text would (I think) be very hard to parse or sort algorithmically. We will likely set up a unicode instance at some point and I will be faced with a conversion. But the design of 'my' data is very rough at present - do you know if there is a real relational design for Hebrew language elements anywhere?