Qt + Flex = Problems

July 7, 2006

GNU Flex, “the fast lexical analyser generator“, is a handy tool for writing compiler-related software, as those who have knows. It saves tremendous amounts of code, allows tokenization rules to be expressed in a concise way, and the generated tokenizer is fast.

For that reason, it is a pity Flex cannot be used in serious projects. Because it doesn’t support Unicode.

I think this is a significant problem for KDE. Not because developers instead chooses hand written tokenizers in order to support Unicode. Because they don’t. They use Flex, despite its severe limitation. Hence, is Unicode ignorant code introduced, and KDE’s brand of multilingual, internationalized software diminishes.

Although it would be better with hand-written tokenizers, they still have the problem of requiring large efforts. For example, Patternist’s tokenizer that handles XQuery/XPath is big and complex, which Flex surely could help simplifying.

I think the solution is to add a QString mode to Flex. When enabled, the tokenizer would consume QString/QChar and therefore in a transparent way take care of Unicode support, while the developer still uses the comforts of Flex. I know nothing about Flex’s internals, but I suspect it would require quite some work.

I think such a thing would mean a lot for KDE, due to the amount of tokenizers in use across the repository. If I were a Troll(of the good kind), I would consider sponsoring getting this infrastructure in place, because it would be an important help for KDE but also a plus for Qt in general.

19 Responses to “Qt + Flex = Problems”

  1. Vincent Says:

    Have you already looked at antlr ?
    http://www.antlr.org/

    I dont know about push/pull mode, but it supports unicode.

    (see u tomorrow on irc 😉

  2. Matej Cepl Says:

    And what about
    http://software.decisionsoft.com/software/flex-2.5.4a-unicode-patch
    ?

    Just browsed through Google and I am not a programmer. Just curious, because it seems unbelieavable, that somebody wouldn’t patch to support Unicode.

  3. Wotan Says:

    Antlr is a grammar analyzer, a tool more powerful than flex, which is a simple lexical analyzer. Antlr could be compared with Yacc for instance.

  4. Vincent Says:

    Wotan> Yes, you’re right. My answer was too quick 🙂 Antlr replace the pair lex/yacc (or flex/bison).

  5. englich Says:

    I took a quick look at antlr, and it seems it requires linking against an external library and including headers. So this is a rather heavy dependency and would require hosting antlr inside Patternist. It seems Java is needed for the parser/scanner generation, but I think that’s ok since it will only be run by Patternist developers working on that particular area.

    However, one should really look deeper into antlr in order to be able to dismiss/accept it, which surely should be done if it’s decided to rewrite Patternist’s tokenizer/parser code into something tight and pretty. One should probably have started at least sketching on the XSL-T parts, since it will be another tokenizer interfacing the parser.

    Regarding Flex/Unicode:

    I’m aware of the Unicode patch floating around. I asked about integrating it on the help-flex list, and got as answer that “What you call the ‘Unicode patch’ I think is merely a patch that redefines tables from being 8-bit to 16-bit. Unicode now exceeds that number. So it is dead.”

    I’ve also briefly discussed Unicode support with Flex maintainer Jonathan S. Shapiro who rather detailed described the issue in Flex. Among other things, he wrote:

    “The C library support for internationalization is laughable. Even in a UNICODE locale, you cannot assume within the standard that a wchar_t holds a unicode code point.

    Further, the C library standards don’t include any concept comparable to unicode character classes.”

    So adding Unicode support seems also to do with the general lack of portable features when using C.

  6. Kevin Kofler Says:

    Why does “Unicode” have to be UTF-16? What’s wrong with UTF-8? Parsing UTF-8 with Flex should work just fine, because UTF-8 is designed so that the programs often don’t even notice you’re using Unicode and still work.

  7. Quintesse Says:

    “Unicode now exceeds that number. So it is dead”

    Which is nonsense because supporting 16bit Unicode would pretty much fix 99.999% of the cases which now go wrong. The characters that are in the extended part of the Unicode spec are so incredibly specific that the chances that you need them in a Flex parser are vanishingly small.

    And if you really wanted you can still support the full Unicode range with only by using 16 bits, by using some some kind of escaping/extension, this is what Java does for example.

  8. Jakob Petsovits Says:

    Hi Frans,

    the solution to your (and my) problem is at http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

    I also needed Unicode support for my Java parser for KDevelop, and it’s doable. You have to know the supported character ranges, but if you got them then the script that you can find there is able to transform stuff like 0x10FF into a Flex regexp.

    Hope that helps!

  9. Jakob Petsovits Says:

    Oh right, and you can find the Flex file for the Java parser here: http://websvn.kde.org/branches/work/kdevelop-pg/examples/java/java_lexer.ll?view=auto

  10. Ramesh N Says:

    php(www.php.net) seem to use Flex for parsing PHP language. And looks like they have added unicode support .

    http://cvs.php.net/viewvc.cgi/ZendEngine2/

  11. Carewolf Says:

    We have a small modification script in KHTML we use to rewrite the output of flex so it works with unicode.

  12. John Snelson Says:

    Unicode support can come in many flavours. As already mentioned, the UTF8 encoding supports all of Unicode, and flex can already handle that.

    I use the “unicode patch” with flex on a project that parses UTF-16, and it works great. You /can/ handle characters above the 16 bit point using UTF-16 surrogate pairs – but it’s unlikely that you’ll ever need to match a character in this range.

  13. Rathish Says:

    Hi John,
    How you did unicode (UTF-16) support to flex? Can you share your knowledge? I am get stuck in implementing unicode support in Flex.:( . I have tried the patch in the decisionsoft site. But for some reason ot is not working.
    How difficult it would be to migrate an existing grammer file in Flex to Antlr? If anyone tried it, can you please share your experience.

  14. John Snelson Says:

    The unicode patch just works for me. You have to make sure you are applying it to the correct version of flex (not the latest), and that you are using the “-U” option when you invoke it.

  15. joe Says:

    “The unicode patch just works for me. You have to make sure you are applying it to the correct version of flex (not the latest), and that you are using the “-U” option when you invoke it.”

    Which version would this be? Where can I get the patch – please explain!!!

  16. John Snelson Says:

    As mentioned in a previous comment, the patch can be found here:

    http://software.decisionsoft.com/software/flex-2.5.4a-unicode-patch

    You might be able to work out the version it applies to ;-).

  17. Mary Says:

    Hi guys,

    can somebody send me the unicode patch for Flex now, because the link above is not working anymore.

    thanks

  18. Tatjana Says:

    Hi guys!

    I would also like to have unicode patch for Flex. It’s totally cruel that something that was free a year ago now isn’t.

  19. John Snelson Says:

    The patch disappeared from it’s original location and the link above, so I’m hosting it here now:

    http://xqilla.sourceforge.net/flex/flex-2.5.4a-unicode-patch


Leave a comment