 |
Languages Evolution: introduction of new keywords
Posted by forax on October 09, 2006 at 02:22 AM | Comments (9)
When you want to add features to a language
without breaking backward compatibility,
a widespread idea that you can't add new keywords.
That is why we can currently see weird proposal in Java space
that try to reuse old keywords to express
new kind of abstraction, by example,
synchronized (closure v0.2 section 3) or (Neal Gafter blog about for).
Why introducing a new keyword breaks already written codes ?
When you specify a new keyword, you need to change the lexer to
recognize sequence of characters as a new token.
Thus the lexer doesn't recognize this sequence as an
identifier anymore.
One magic solution is to use a special character (or more) for
differenciate keyword from identifier.
Lot of scripting languages use '$', '#' etc. to tag variables,
Perl6
is the best example.
Scripting language use special caracters not only
to simplify the lexing process but to help their runtime system
to choose between overloaded operations.
So adding a new keyword is not a major problem
for those languages.
Java is a strong typed language so it doesn't need
such special characters and
we are stuck while we continue to see lexers as
lex.
The problem comes from the lexer, so the solution is to change
how lexer works.
Contextual keywords
Let me take an example, "enum" is a new keyword introduced in 1.5
to declare enumerated type.
So the lexer of an 1.5 compiler
now recognize "enum" as a keyword in the whole program.
But in fact, the "enum" that interests a language designer
is only needed to recognize "enum" as a keyword
in the case of a type declaration not in a block of code.
The solution is to use a lexer that implements contextual
keywords, i.e a lexer that let the parser activate or
not rules needed to recognize tokens depending on
the parser state.
enum Foo { // keyword
public static void main(String[]) {
Enumeration enum=... // identifier
}
}
With two colleagues, i've written a new Parser Generator
named Tatoo
that generates this kind of lexer.
The tutorial is in french at this time because we haven't
lot of time and by our students need it.
But a translation will be available soon.
Slides in PDF and an article from PPPJ'06
are available in english.
Tatoo contains other innovative features like grammar versioning,
full NIO support (push lexer/parser),
lexing without unicode decoding, AST generator.
I will blog about those features later.
Bookmark blog post: del.icio.us Digg DZone Furl Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment
-
You'd be in trouble if I wrote code like:
class enum {
enum en = new enum();
}
I think you'd need to either lookahead further, really hack the grammar or use an even more general category of grammar. Neither would result in nice language. If local enums were permitted, then you would get into even more trouble.
On the other hand, you might rule that such uses of the character sequence aren't permitted by the Java code style. In that case you could have just used a two word keyword, say "enumclass".
Posted by: tackline on October 09, 2006 at 08:45 AM
-
No problem with your code, enum is recognized as a keyword only
when declaring a type. After, the rule is disable so enum is recognized
as an identifier.
You are right about local enums, the lexer will choose
"enum" as keyword instead of as identifier. It's not black magic :)
What you need is just a classical grammar to be able to process all possible terminals at each position of a dotted production. At runtime, the parser only activates the rules that recognized the terminals corresponding to its state.
Rémi Forax
Posted by: forax on October 09, 2006 at 02:08 PM
-
Fortran has no keywords, so the following is legal:
if = 1
if if.eq.1 then ...
In fact you dont even have to use spaces:
if1
ifif.eq.1then ...
So if you really try hard you can write the parser. But why bother, you can add a new keyword and add a compiler switch, e.g. -source 1.3 to turn off assert and enum. There can't be that many files that you must use a new construct in, can't split into old and new files, and you need to use old code that inconveniently uses the new keyword. It just doesn't seem worth all the trouble of writing the difficult parser. Which is not only difficult for the computer but also difficult for the programmer.
Posted by: hlovatt on October 10, 2006 at 12:27 AM
-
hlovatt,
writing a parser is easy, you just need a good parser generator :)
Fortran is not kown to be lexer friendly, all compiler books have
an example in Fortran.
Posted by: forax on October 10, 2006 at 01:17 AM
-
One of Java's advantages over C++ has been the relative ease with which the language can be lexed/parsed. This has made it much easier to build tools for the language and accordingly more have been built. Some of this ease was lost with 1.5 but it remains much easier than C++.
On the other hand the new api giving access to the parse tree reduce the number of people who need to write parsers; most can just piggy back off the system provided classes.
Posted by: mthornton on October 10, 2006 at 06:48 AM
-
In Algol 60 (yes, that was 1960), keywords were distinguished from names by enclosing them in single quotes. They were then typeset in bold.
As C.A.R. Hoare said: "Here is a language so far ahead of its time, that it was not only an improvement on its predecessors, but also on nearly all its successors."
Posted by: cayhorstmann on October 10, 2006 at 08:09 AM
-
@mthornton,
A C++ lexer needs semantic informations, what i propose only needs
syntactic informations.
About the compiler Tree API, i still wait a tree builder.
@cayhorstmann,
i've read the page about ALGOL 60 on wikipedia,
it's impressive.
Posted by: forax on October 11, 2006 at 12:30 AM
-
I still prefer lex's way of doing things.
Everytime someone tries to do something like this we end up with a PL/I like language.
IF IF = THEN THEN WRITE IF
makes me shiver.
Posted by: pjmlp on October 11, 2006 at 03:49 AM
-
@pjmlp,
with the lexer of Tatoo you can choose if a keyword is contextual or
not, by example, a rule that doesn't send a terminal to the parser
like the rule that defined blank is not contextual by default.
Rémi
Posted by: forax on October 12, 2006 at 06:03 AM
|