* The code below has been updated for a better wps reader go here *
I'm trying to extract text from various file formats for a search engine and trying to avoid any external packages. I have 10 thousand files to extract text from and a lot of them are WPS files (Microsoft Works - you know, that free office suite that comes preinstalled on many windows boxes).
I was opening the files and using regular expressions and some text sanitizing to try and get decent text from the file. Unfortunately, the text is split into blocks so I got some part-words as long as font names and other junk words from the metadata of the files. I had used libwps in the past but didn't want the dependency in my code. Most Windows based document formats from that stage use some sort of OLE-Stream content that is often kind of difficult to get your head around when your staring at the bytes in a hex editor. After a little reading of the libwps code, a few calculations an jumps in Hex Workshop and a few guesses I managed to work out some code to pull text from this format. Its work in progress and it needs some extensive testing (and it'll get it when I'm using it!) but looks good so far:
The greatest number of files are the old style Word 95-2003 (doc) files. Now I need to try do the same with those!