Steve’s Blog

Just another compsci weblog

Making Boost.Tokenizer More Useful

This is a continuation of my last post. In it I discussed how I’d like to use the TokenizerFunctions I developed with input iterators. As it works now, however, this is impossible because InputIterators do not guarantee that if iter1 == iter2 then ++iter1 == ++iter2. To solve this, I need to write some sort of iterator wrapper that will buffer the input for me. I mentioned that boost::spirit has an iterator called multi_pass which does this, but I couldn’t get it to work. So I wrote one myself. The difference between my buffered_iterator and boost’s multi_pass is that my iterator keeps track of all references to the original iterator (assuming they’re made through copies of alternate constructor) and can reduce the size of the buffer automatically when there are no iterators pointing to the beginning. The effect of this is that when you’re tokenizing a large sequence using an ifstream, the entire buffer is never loaded into memory at one time (of course with today’s computers this optimization isn’t necessary most of the time but whatever).

To facilitate this functionality, the buffered_iterator class has two constructors: An alternate constructor and a copy constructor. The alternate constructor accepts the Input Iterator as an argument and allocates a structure to be shared across all copies of the iterator. The code looks like this:

template< class IterT, class IterDerefT >
class buffered_iterator : public std::iterator< std::forward_iterator_tag, IterDerefT, ptrdiff_t, IterDerefT*, IterDerefT& > {
	struct refcounter {
		refcounter( IterT iter ) : iter_( iter ), buffer_( std::list< IterDerefT >( 1 ) ), refs_( std::list< int >( 1, 0 ) ) { }
		IterT iter_;
		std::list< IterDerefT > buffer_;
		std::list< int > refs_;

	explicit buffered_iterator( IterT iter ) 
		: refcounter_( new refcounter( iter ) ), bufferiter_( refcounter_->buffer_.begin( ) ), refiter_( refcounter_->refs_.begin( ) ) { alloc( ); ++*this; }
	buffered_iterator( const buffered_iterator< IterT, IterDerefT > & x ) 
		: refcounter_( x.refcounter_ ), bufferiter_( x.bufferiter_ ), refiter_( x.refiter_ ) { alloc( ); }
	~buffered_iterator( ) { dealloc( ); }

	refcounter * refcounter_;
	typename std::list< IterDerefT >::iterator bufferiter_;
	std::list< int >::iterator refiter_;

The refcounter structure will hold the original input iterator, as well as the data buffer and reference buffer. The reference buffer stores the amount of iterators pointing to the corresponding data buffer locations. Therefore, the data buffer and reference buffer are always the same size. Say that we have a data buffer equal to { ‘A’, ‘B’, ‘C’ } (assuming IterDerefT = char) and we have two iterators pointing to ‘C’ in the data buffer. That means that the reference buffer would look like { 0, 0, 2 }. Since we’re only allowing forward iteration on this iterator, we know that ‘A’ and ‘B’ will never be referenced again, so we can get rid of them. And that’s what the minimize method does:

void minimize( ) {
	// remove any portions of the beginning of the buffer that aren't referenced
	for( std::list< int >::iterator iter = refcounter_->refs_.begin( ); iter != refcounter_->refs_.end( ) && *iter == 0; ) {
		refcounter_->buffer_.pop_front( );
		refcounter_->refs_.pop_front( );

The minimize method gets called whenever the iterator is modified in some way (iterator is destructed or operator++ is called). All that’s left now is to fill in the required methods (operator++, operator==, operator=, etc.,). I want to make one final note that we can’t use the boost::tokenizer class any more because it saves the starting iterator you pass it (and therefore the buffer would never be minimized). Instead, we have to use boost::token_iterator_generator. I’ll let you check it out in the code listing.



2 Comments so far

  1. Jake Voytko May 5th, 2010 9:14 am

    I would try to use typedefs to make the code a little clearer

    For instance:

    tfuncs.push_back( new tfunc_polym_derived< find_str_tfunc, fiter_t >( find_str_tfunc( “while” ) ) );

    could be..

    tfuncs.push_back(new tfunc_str_filter(find_str_tfunc( “while” )));

    And is there a reason you’re doing your own refcounting instead of using boost::shared_ptr? Just to see what it’s like to roll your own?

    BTW, is there a good reference somewhere for implementing your own iterators? I need to do that for my redblack trees…

  2. admin May 6th, 2010 6:04 am

    I’m not sure how I could boost::shared_ptr in this case, since I have application-specific data in the reference counting structure.

    As for implementing iterators, I just followed the format used by the sample implementations on ( ). It helps to derive from std::iterator so that certain STL algorithms work properly.

Leave a reply