I've started work on pulling text from the Microsoft Word format (pre Office 2007 doc files) using only python. I had already completed a "passable" Microsoft Works (wps file) reader, but took a few liberties with the format and avoided reading the file properly via the Ole Compact file specification. A few hours messing with the doc file via my hex editor told me that I cant avoid doing things properly this time. So I wrote a rather simplified OleDocument class that simply parses the file and allows extraction of streams (it doesn't touch the mini streams and leaves a few things out that I don't so far need).
So here's the updated wps code with the new OleDocument class. The WPSReader is a little simpler now and seems far more robust. It still needs testing though!
Now I need to start on the doc format. That's going to be far more difficult!