Many languages still don't put whitespace between some or all words (Chinese, Japanese). Many languages tend to use long compound words which are better represented as sequences of tokens rather than one token (German, Russian). Many inputs are dirty (coming from OCR, for example) so whitespace can be unreliable. And even in clean English text, inflected words may be better tokenized as sequences of tokens rather than a single token.