Qt + Flex = Problems

July 7, 2006

GNU Flex, “the fast lexical analyser generator“, is a handy tool for writing compiler-related software, as those who have knows. It saves tremendous amounts of code, allows tokenization rules to be expressed in a concise way, and the generated tokenizer is fast.

For that reason, it is a pity Flex cannot be used in serious projects. Because it doesn’t support Unicode.

I think this is a significant problem for KDE. Not because developers instead chooses hand written tokenizers in order to support Unicode. Because they don’t. They use Flex, despite its severe limitation. Hence, is Unicode ignorant code introduced, and KDE’s brand of multilingual, internationalized software diminishes.

Although it would be better with hand-written tokenizers, they still have the problem of requiring large efforts. For example, Patternist’s tokenizer that handles XQuery/XPath is big and complex, which Flex surely could help simplifying.

I think the solution is to add a QString mode to Flex. When enabled, the tokenizer would consume QString/QChar and therefore in a transparent way take care of Unicode support, while the developer still uses the comforts of Flex. I know nothing about Flex’s internals, but I suspect it would require quite some work.

I think such a thing would mean a lot for KDE, due to the amount of tokenizers in use across the repository. If I were a Troll(of the good kind), I would consider sponsoring getting this infrastructure in place, because it would be an important help for KDE but also a plus for Qt in general.

Posted by Frans Englich
Filed in Software development

19 Comments »

19 Responses to “Qt + Flex = Problems”

Vincent Says:

July 7, 2006 at 6:11 pm
Have you already looked at antlr ?
http://www.antlr.org/

I dont know about push/pull mode, but it supports unicode.

(see u tomorrow on irc 😉

Reply
Matej Cepl Says:

July 7, 2006 at 6:24 pm
And what about
http://software.decisionsoft.com/software/flex-2.5.4a-unicode-patch
?

Just browsed through Google and I am not a programmer. Just curious, because it seems unbelieavable, that somebody wouldn’t patch to support Unicode.

Reply
Wotan Says:

July 7, 2006 at 6:28 pm
Antlr is a grammar analyzer, a tool more powerful than flex, which is a simple lexical analyzer. Antlr could be compared with Yacc for instance.

Reply
Vincent Says:

July 7, 2006 at 6:33 pm
Wotan> Yes, you’re right. My answer was too quick 🙂 Antlr replace the pair lex/yacc (or flex/bison).

Reply
englich Says:

July 7, 2006 at 8:59 pm
I took a quick look at antlr, and it seems it requires linking against an external library and including headers. So this is a rather heavy dependency and would require hosting antlr inside Patternist. It seems Java is needed for the parser/scanner generation, but I think that’s ok since it will only be run by Patternist developers working on that particular area.

However, one should really look deeper into antlr in order to be able to dismiss/accept it, which surely should be done if it’s decided to rewrite Patternist’s tokenizer/parser code into something tight and pretty. One should probably have started at least sketching on the XSL-T parts, since it will be another tokenizer interfacing the parser.

Regarding Flex/Unicode:

I’m aware of the Unicode patch floating around. I asked about integrating it on the help-flex list, and got as answer that “What you call the ‘Unicode patch’ I think is merely a patch that redefines tables from being 8-bit to 16-bit. Unicode now exceeds that number. So it is dead.”

I’ve also briefly discussed Unicode support with Flex maintainer Jonathan S. Shapiro who rather detailed described the issue in Flex. Among other things, he wrote:

“The C library support for internationalization is laughable. Even in a UNICODE locale, you cannot assume within the standard that a wchar_t holds a unicode code point.

Further, the C library standards don’t include any concept comparable to unicode character classes.”

So adding Unicode support seems also to do with the general lack of portable features when using C.

Reply
Kevin Kofler Says:

July 7, 2006 at 9:06 pm
Why does “Unicode” have to be UTF-16? What’s wrong with UTF-8? Parsing UTF-8 with Flex should work just fine, because UTF-8 is designed so that the programs often don’t even notice you’re using Unicode and still work.

Reply
Quintesse Says:

July 8, 2006 at 12:46 am
“Unicode now exceeds that number. So it is dead”

Which is nonsense because supporting 16bit Unicode would pretty much fix 99.999% of the cases which now go wrong. The characters that are in the extended part of the Unicode spec are so incredibly specific that the chances that you need them in a Flex parser are vanishingly small.

And if you really wanted you can still support the full Unicode range with only by using 16 bits, by using some some kind of escaping/extension, this is what Java does for example.

Reply
Jakob Petsovits Says:

July 8, 2006 at 9:03 am
Hi Frans,

the solution to your (and my) problem is at http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

I also needed Unicode support for my Java parser for KDevelop, and it’s doable. You have to know the supported character ranges, but if you got them then the script that you can find there is able to transform stuff like 0x10FF into a Flex regexp.

Hope that helps!

Reply
Jakob Petsovits Says:

July 8, 2006 at 9:05 am
Oh right, and you can find the Flex file for the Java parser here: http://websvn.kde.org/branches/work/kdevelop-pg/examples/java/java_lexer.ll?view=auto

Reply
Ramesh N Says:

July 9, 2006 at 10:25 am
php(www.php.net) seem to use Flex for parsing PHP language. And looks like they have added unicode support .

http://cvs.php.net/viewvc.cgi/ZendEngine2/

Reply
Carewolf Says:

July 10, 2006 at 4:38 pm
We have a small modification script in KHTML we use to rewrite the output of flex so it works with unicode.

Reply
John Snelson Says:

July 28, 2006 at 4:27 pm
Unicode support can come in many flavours. As already mentioned, the UTF8 encoding supports all of Unicode, and flex can already handle that.

I use the “unicode patch” with flex on a project that parses UTF-16, and it works great. You /can/ handle characters above the 16 bit point using UTF-16 surrogate pairs – but it’s unlikely that you’ll ever need to match a character in this range.

Reply
Rathish Says:

October 5, 2006 at 10:28 am
Hi John,
How you did unicode (UTF-16) support to flex? Can you share your knowledge? I am get stuck in implementing unicode support in Flex.:( . I have tried the patch in the decisionsoft site. But for some reason ot is not working.
How difficult it would be to migrate an existing grammer file in Flex to Antlr? If anyone tried it, can you please share your experience.

Reply
John Snelson Says:

October 31, 2006 at 7:03 pm
The unicode patch just works for me. You have to make sure you are applying it to the correct version of flex (not the latest), and that you are using the “-U” option when you invoke it.

Reply
joe Says:

December 27, 2006 at 10:26 pm
“The unicode patch just works for me. You have to make sure you are applying it to the correct version of flex (not the latest), and that you are using the “-U” option when you invoke it.”

Which version would this be? Where can I get the patch – please explain!!!

Reply
John Snelson Says:

January 11, 2007 at 2:57 pm
As mentioned in a previous comment, the patch can be found here:

http://software.decisionsoft.com/software/flex-2.5.4a-unicode-patch

You might be able to work out the version it applies to ;-).

Reply
Mary Says:

April 7, 2008 at 8:24 pm
Hi guys,

can somebody send me the unicode patch for Flex now, because the link above is not working anymore.

thanks

Reply
Tatjana Says:

December 26, 2008 at 1:47 am
Hi guys!

I would also like to have unicode patch for Flex. It’s totally cruel that something that was free a year ago now isn’t.

Reply
John Snelson Says:

June 8, 2009 at 11:10 am
The patch disappeared from it’s original location and the link above, so I’m hosting it here now:

http://xqilla.sourceforge.net/flex/flex-2.5.4a-unicode-patch

Reply