The Source for Java Technology Collaboration
User: Password:



Rémi Forax

Rémi Forax's Blog

Languages Evolution: introduction of new keywords

Posted by forax on October 09, 2006 at 02:22 AM | Comments (9)

When you want to add features to a language without breaking backward compatibility, a widespread idea that you can't add new keywords.

That is why we can currently see weird proposal in Java space that try to reuse old keywords to express new kind of abstraction, by example, synchronized (closure v0.2 section 3) or (Neal Gafter blog about for).

Why introducing a new keyword breaks already written codes ?

When you specify a new keyword, you need to change the lexer to recognize sequence of characters as a new token. Thus the lexer doesn't recognize this sequence as an identifier anymore.

One magic solution is to use a special character (or more) for differenciate keyword from identifier. Lot of scripting languages use '$', '#' etc. to tag variables, Perl6 is the best example.
Scripting language use special caracters not only to simplify the lexing process but to help their runtime system to choose between overloaded operations. So adding a new keyword is not a major problem for those languages.

Java is a strong typed language so it doesn't need such special characters and we are stuck while we continue to see lexers as lex. The problem comes from the lexer, so the solution is to change how lexer works.

Contextual keywords

Let me take an example, "enum" is a new keyword introduced in 1.5 to declare enumerated type. So the lexer of an 1.5 compiler now recognize "enum" as a keyword in the whole program.
But in fact, the "enum" that interests a language designer is only needed to recognize "enum" as a keyword in the case of a type declaration not in a block of code.

The solution is to use a lexer that implements contextual keywords, i.e a lexer that let the parser activate or not rules needed to recognize tokens depending on the parser state.

enum Foo {                                 // keyword
  public static void main(String[]) {
    Enumeration enum=... // identifier
  }
}

With two colleagues, i've written a new Parser Generator named Tatoo that generates this kind of lexer.
The tutorial is in french at this time because we haven't lot of time and by our students need it. But a translation will be available soon. Slides in PDF and an article from PPPJ'06 are available in english.

Tatoo contains other innovative features like grammar versioning, full NIO support (push lexer/parser), lexing without unicode decoding, AST generator. I will blog about those features later.


Bookmark blog post: del.icio.us del.icio.us Digg Digg DZone DZone Furl Furl Reddit Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment

  • You'd be in trouble if I wrote code like:


    class enum {
    enum en = new enum();
    }


    I think you'd need to either lookahead further, really hack the grammar or use an even more general category of grammar. Neither would result in nice language. If local enums were permitted, then you would get into even more trouble.

    On the other hand, you might rule that such uses of the character sequence aren't permitted by the Java code style. In that case you could have just used a two word keyword, say "enumclass".

    Posted by: tackline on October 09, 2006 at 08:45 AM


  • No problem with your code, enum is recognized as a keyword only
    when declaring a type. After, the rule is disable so enum is recognized
    as an identifier.

    You are right about local enums, the lexer will choose
    "enum" as keyword instead of as identifier. It's not black magic :)

    What you need is just a classical grammar to be able to process all possible terminals at each position of a dotted production. At runtime, the parser only activates the rules that recognized the terminals corresponding to its state.

    Rémi Forax

    Posted by: forax on October 09, 2006 at 02:08 PM

  • Fortran has no keywords, so the following is legal:

    if = 1
    if if.eq.1 then ...

    In fact you dont even have to use spaces:

    if1
    ifif.eq.1then ...

    So if you really try hard you can write the parser. But why bother, you can add a new keyword and add a compiler switch, e.g. -source 1.3 to turn off assert and enum. There can't be that many files that you must use a new construct in, can't split into old and new files, and you need to use old code that inconveniently uses the new keyword. It just doesn't seem worth all the trouble of writing the difficult parser. Which is not only difficult for the computer but also difficult for the programmer.

    Posted by: hlovatt on October 10, 2006 at 12:27 AM


  • hlovatt,
    writing a parser is easy, you just need a good parser generator :)

    Fortran is not kown to be lexer friendly, all compiler books have
    an example in Fortran.

    Posted by: forax on October 10, 2006 at 01:17 AM

  • One of Java's advantages over C++ has been the relative ease with which the language can be lexed/parsed. This has made it much easier to build tools for the language and accordingly more have been built. Some of this ease was lost with 1.5 but it remains much easier than C++.
    On the other hand the new api giving access to the parse tree reduce the number of people who need to write parsers; most can just piggy back off the system provided classes.

    Posted by: mthornton on October 10, 2006 at 06:48 AM

  • In Algol 60 (yes, that was 1960), keywords were distinguished from names by enclosing them in single quotes. They were then typeset in bold.
    As C.A.R. Hoare said: "Here is a language so far ahead of its time, that it was not only an improvement on its predecessors, but also on nearly all its successors."

    Posted by: cayhorstmann on October 10, 2006 at 08:09 AM


  • @mthornton,
    A C++ lexer needs semantic informations, what i propose only needs
    syntactic informations.

    About the compiler Tree API, i still wait a tree builder.

    @cayhorstmann,
    i've read the page about ALGOL 60 on wikipedia,
    it's impressive.

    Posted by: forax on October 11, 2006 at 12:30 AM

  • I still prefer lex's way of doing things.

    Everytime someone tries to do something like this we end up with a PL/I like language.


    IF IF = THEN THEN WRITE IF


    makes me shiver.

    Posted by: pjmlp on October 11, 2006 at 03:49 AM


  • @pjmlp,

    with the lexer of Tatoo you can choose if a keyword is contextual or
    not, by example, a rule that doesn't send a terminal to the parser
    like the rule that defined blank is not contextual by default.

    Rémi

    Posted by: forax on October 12, 2006 at 06:03 AM



Only logged in users may post comments. Login Here.


Powered by
Movable Type 3.01D
 Feed java.net RSS Feeds