Steve’s Blog

Just another compsci weblog

Making Boost.Tokenizer Useful

Recently I began writing a scripting language and interpreter. To this end, I needed to develop a scanner. I had a brief look at Boost.Tokenizer but the build in TokenizerFunction objects were unhelpful. I ended up implementing my own basic scanner but recently I came back and took a second look at Boost.Tokenizer. Inside is actually a useful library waiting to get out; you just have to implement your own TokenizerFunction objects. Let’s start with a basic one:

class find_str_tfunc {
public:
	find_str_tfunc( const std::string & str ) : str_( str ) { }
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		std::string::const_iterator iter = str_.begin( );
		for( ; *iter == *next && iter != str_.end( ); ++iter, ++next );
		if( iter == str_.end( ) ) {
			tok = str_;
			return true;
		} else {
			return false;
		}
	}
	void reset( ) { }
private:
	std::string str_;
};

This basic TokenizerFunction attempts to match a string with the beginning of the sequence. Pretty boring and useless so far. Let’s see another:

class find_ident_tfunc {
public:
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		std::string::const_iterator begin = next;
		if( *next >= 'A' && *next <= 'Z' || *next >= 'a' && *next <= 'z' ) {
			++next;
			for( ; ( *next >= 'A' && *next <= 'Z' || *next >= 'a' && *next <= 'z' || *next >= '0' && *next <= '9' ) && next != end; ++next );
			tok = std::string( begin, next );
			return true;
		} else {
			return false;
		}
	}
	void reset( ) { }
private:
};

Slightly more interesting. This time we are trying to match an identifier with the beginning of the sequence (in our case, an identifier is defined by the regex [A-Za-z][A-Za-z0-9]*). Let's do one more:

class find_integer_tfunc {
public:
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		std::string::const_iterator begin = next;
		if( *next >= '0' && *next <= '9' ) {
			++next;
			for( ; ( *next >= '0' && *next <= '9' ) && next != end; ++next );
			tok = std::string( begin, next );
			return true;
		} else {
			return false;
		}
	}
	void reset( ) { }
private:
};

This should be self explanatory if you could follow the previous examples. So where do these get us? Nowhere. We need a TokenizerFunction which will combine these somehow. This brings us to multi_tfunc:

template< class TFunc >
class multi_tfunc {
public:
	template< class X >
	multi_tfunc( X begin, X end ) : tfuncs_( begin, end ) { }
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		for( ; *next == ' '; ++next ); // skip whitespace

		std::string longest;
		for( typename std::list< TFunc >::const_iterator iter = tfuncs_.begin( ); iter != tfuncs_.end( ); ++iter ) {
			std::string::const_iterator nextcopy = next;
			std::string result;
			if( (*iter)( nextcopy, end, result ) && result.size( ) > longest.size( ) ) {
				longest = result;
			}
		}

		if( longest.empty( ) ) {
			return false;
		} else {
			std::advance( next, longest.size( ) );
			tok = longest;
			return true;
		}
	}
	void reset( ) { }
private:
	std::list< TFunc > tfuncs_;
};

So how does this work? First we skip whitespace. Then, for each TokenizerFunction in our list, we create a copy of the next iterator and execute that TokenizerFunction. But why do we want the longest token from all of the TokenizerFunctions? Consider the following two tokens in the C language: "while" and "while123". In this case, "while" is a token for a while loop, whereas "while123" is an identifier. By always choosing the longest token, we will avoid this problem.

There is one final problem: std::list< TFunc > is a homogeneous container. That is, we cannot add different types of TFuncs. To get around this, we need to create some helper classes:

class tfunc_polym {
public:
	virtual bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const = 0;
	virtual void reset( ) = 0;
};

template< class TFunc >
class tfunc_polym_derived : public tfunc_polym {
public:
	tfunc_polym_derived( TFunc tfunc ) : tfunc_( tfunc ) { }
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		return tfunc_( next, end, tok );
	}
	void reset( ) {
		tfunc_.reset( );
	}
private:
	TFunc tfunc_;
};

class tfunc_polym_adapter {
public:
	tfunc_polym_adapter( tfunc_polym * tfunc ) : tfunc_( tfunc ) { }
	bool operator( )( std::string::const_iterator & next, std::string::const_iterator end, std::string & tok ) const {
		return (*tfunc_)( next, end, tok );
	}
	void reset( ) {
		tfunc_->reset( );
	}
private:
	tfunc_polym * tfunc_;
};

Now we can use multi_tfunc with TFunc = tfunc_polym_adapter and we will be able to add any of the TokenizerFunctions we have just created. Let's put it all together:

int main( ) {
	std::list< tfunc_polym_adapter > tfuncs;
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( "while" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( "(" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( ")" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( "{" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( "}" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc >( find_str_tfunc( ">" ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_ident_tfunc >( find_ident_tfunc( ) ) );
	tfuncs.push_back( new tfunc_polym_derived< find_integer_tfunc >( find_integer_tfunc( ) ) );

	std::string s = "while( identifier123 > 5 ) { }";
	boost::tokenizer< multi_tfunc< tfunc_polym_adapter > > tok( s, multi_tfunc< tfunc_polym_adapter >( tfuncs.begin( ), tfuncs.end( ) ) );
	std::copy( tok.begin( ), tok.end( ), std::ostream_iterator< std::string >( std::cout, "\n" ) );

	return 0;
}

Output:

while
(
identifier123
>
5
)
{
}

This is a pretty simple example but it should show you how to expand on this. There are a few improvements I'd like to make on this. The first improvement I'd like to make is to make the basic TokenizerFunction types i've created more general. That is, they should be able to accept any type of iterator. Specifically, I'd like to be able to use an istream_iterator to read data directly from a file. However, istream_iterator is only an input iterator and so the copy that I perform in multi_tfunc would not work properly. To get around this I would need to create some sort of buffered istream_iterator. Second, I'd like to cut down on the ugliness of constructing the boost::tokenizer object. Although with the new auto keyword in C++0x we might not have to worry about that.

The complete code listing is here: tfunc.cpp

Update: It looks like Boost has beaten me to the punch when it comes to one of the problems: The multi_pass iterator, which converts an input_iterator into a forward_iterator, can be used to solve the iterator copy problem I described above.

No comments

No comments yet. Be the first.

Leave a reply