I have released version 2.0 of Lojgloss. Very cool, in many ways. Check it out.
Archive for the 'Lojban' Category
New LojGloss version
I’ve added parsing support to LojGloss.
Lojban is a language created by people, for people. The grammar is unambiguous—any sentence can only be interpreted in one way, grammatically. You can still have ambiguity and vagueness in meaning, just not in the grammar. My program will take any valid lojban and show you how it looks grammatically. It’ll also show you the meaning of words that it knows about. Can this much be said for any other language? No. (Unless there’s some other conlang that does this. In which case Lojban has an order of magnitude more speakers, at least.)
Update: It doesn’t work. Oops. Gismu support is borked, and I’m looking into it. In the meanwhile, 1.4 is still serviceable.
Packrat parser dependency analysis memoization optimization
Bryan Ford has written some nice papers on PEG parsers and the packrat algorithm. (Wikipedia on parsers.) In a nutshell, the packrat algorithm for implementing parsers for PEG grammars achieves unlimited lookahead in an input stream with reasonable resources by memoizing the results of specific grammar rules (storing the result of a rule at individual points in the input stream so that the rule is calculated at most once per input position, and usually much less). In his thesis, in section 4.4.2, he lists (among others) an optimization that helps alleviate the large memory requirements of packrat parsers. By analyzing a PEG grammar before using it to parse an input (or before generating a parser) it can be determined that some rules will not call others recursively, and thus can be de-memoized with the parser remaining O(n) (though a bit slower).
This is a good start. But another optimization I don’t see mentioned anywhere (possibly because it doesn’t apply to a parser implemented in Haskell) is to perform an analysis on rules to automatically determine backtracking scopes. Once certain rules complete sucessfully (usually medium level rules, like “statement” in grammar for a C-like language) it should be possible to guarantee that a whole class of lower-level ones (lexing rules, often) will not be called again on a position inside of the text included in the high-level rule. By grouping rules’ memoization into related structures, it should be possible to easily and quickly free large block of intermediate results as these higher-level statements are matched.
This analysis would be pretty obvious in certain places for a human to give hints about. Whether it’s possible for the parser (or parser compiler) to perform this analysis usefully, I’m not sure.
On the ambiguity of Lojban
Many say that Lojban is an unambiguous language. This is true in many respects, and is probably more true for Lojban than for practically any other language. Lojban has an unambiguous phonology (when pronounced clearly and without error) and an unambiguous grammar (again, when used correctly). In its primary word list (both for gismu and cmavo), it avoids polysemy, the assignment of multiple meanings to a word.
Now, this lack of ambiguity, as far as it goes, makes for extremely fascinating discussions about “little words”, like “the” and “so” and “a”, and about the various basic mechanisms of language. But what it does not do is substantially decrease the occurrence of ambiguity in language as it’s actually used.
Lojban rules; English sucks
While I really like the language Lojban and really wish there were three hundred hours in the day, it’s occurred to me recently that it should be possible to modify English to give it one of the advantages Lojban currently has over it, and let English still be pretty comprehensible. (A much more minor modification than, say, Loglish. In fact, Loglish isn’t really modified English so much as it is modified Lojban.)
The advantage I have in mind is the lackadaisical tense system. You don’t ever have to specify the tense of a verb (selbri) unless you so desire. If you don’t specify, the tense is left up to context, including especially the tense you specified for any previous selbri, but also allowing other considerations.
Now, there are a few problems with trying to port over such a system to English. For one, a whole bunch of idioms are tied more closely to their sounds than the tenses of the verbs in them, so changing the verb tense could make them hard to recognize. And then, there’s just a general weirdness about not having verb tenses in English. But it’s still worth a try, for sure. So what technical issues are there?
There’s some decisions to be made. How do you choose a verb tense to be the neutral tense? And how do you show positive tenses? It’d be nice if English had a simple infinitive, like say Spanish. That would be a natural candidate for the neutral tense. Alas, English’s infinitive is unwieldly for frequent use. (”English’s infinitive to be unwieldly for frequent use.”) One could take the most common verb tense, the present tense, first-person conjugation, and just use that everywhere. Or, maybe, the infinitive minus the “to”. Then when one wants to switch out of the neutral tense, one just uses some specific other tense. What happens if you want to use a non-neutral present tense? Not sure. Maybe “does” as an auxiliary verb? One could borrow “ca” (pronounced “sha”) from Lojban, but that’s not amenable to being immediately understood. Then again, the whole practice probably would take a bit of knowledge to read (and get the full meaning), so perhaps that’s not too much knowledge to require.
Another approach is to use a single form of every verb, and use markers for all indication of tense. That avoids the ambiguity problem between the neutral and non-neutral forms, but makes the text even more odd sounding if the markers are the English auxiliaries, and increases the learning curve a lot of Lojban’s (more flexible and much more numerous) markers are used. But perhaps a bit more odd-sounding text is a reasonable sacrifice in this case, and the more comprehensible to the uninitiated, the better. So let’s see what this technique leads to. I’m going to go with the “infinitive minus ‘to’” form, and translate the first paragraph of this post. For past tense, I’ll use “did”, for future, “will”, and for progressive of any time, “has”. I think I’ll try using “is” as a present tense marker, as the word “is” isn’t used as a verb, since “be” is the infinitive version. Bolded regions indicate things I’m just not sure about.
While I really like the language Lojban and really wish there be three hundred hours in the day, it has did occur to me recently that it should be possible to modify English to give it one of the advantages Lojban currently have over it, and let English still be pretty comprehensible. (A much more minor modification than, say, Loglish. In fact, Loglish be not really modified English so much as it be modified Lojban.)
OK, so that was interesting. “Like” in the first sentence “sounds” present tense, even though it’s neutral (having no marker). And “isn’t” becomes “be not”, a very awkward construction, that probably takes one a moment to recognize. What is the present tense equivalent to “has occurred”? “has been occurring” would be present progressive, but that’s not quite the same. The future tense would be “will have occurred”. There’s the present tense, completive, “has occurred”, as in “this breaking news just in: such-and-such has occurred.”
I’m not even sure what the verb is in the phrase “should be possible”. “Be” sort of sounds like it, but what the hell part of speech is “should” in that position? An auxiliary verb? If so, does it belong there as is? Well, I guess I’ve reached the end of my explicit knowledge of the English tense system.
That was pretty fun.