digitally-disturbed: Reading Microsoft Works WPS files in Python

* The code below has been updated for a better wps reader go here *

I'm trying to extract text from various file formats for a search engine and trying to avoid any external packages. I have 10 thousand files to extract text from and a lot of them are WPS files (Microsoft Works - you know, that free office suite that comes preinstalled on many windows boxes).

I was opening the files and using regular expressions and some text sanitizing to try and get decent text from the file. Unfortunately, the text is split into blocks so I got some part-words as long as font names and other junk words from the metadata of the files. I had used libwps in the past but didn't want the dependency in my code. Most Windows based document formats from that stage use some sort of OLE-Stream content that is often kind of difficult to get your head around when your staring at the bytes in a hex editor. After a little reading of the libwps code, a few calculations an jumps in Hex Workshop and a few guesses I managed to work out some code to pull text from this format. Its work in progress and it needs some extensive testing (and it'll get it when I'm using it!) but looks good so far:

import re

import struct

WPSMAGICPATTERN = re.compile(r"(CHNKWKS|CHNKINK)")

WPSSTRIPPATTERN = re.compile(r"\r")

class WPSReader(object):

    TEXT_BLOCK = 0x0E00

    def __init__(self, file_name):

        self.file_name = file_name

        self.magic_pattern = WPSMAGICPATTERN

        self.strip_pattern = WPSSTRIPPATTERN

    def _process_entries(self, entry_buff):

        magic, local, next_offset = struct.unpack("<HHI", entry_buff[:8])

        if magic != 0x01F8:

            raise ReaderError("Invalid format - Entry magic tag incorrect")

        entry_pos = 0x08 #2 WORDS & 1 DWORD

        for i in range(local):

            size = struct.unpack("<H", entry_buff[entry_pos:entry_pos+0x2])[0]

            name, offset, size = struct.unpack("<2x4s10xII", entry_buff[entry_pos:entry_pos+size])

            if name == "TEXT": #Success!

                return (local, 0x00, offset, size)

            entry_pos += size

        return (local, next_offset, 0x00, 0x00) #Needs to be run again

    def extract_text(self):

        with open(self.file_name, "rb") as fd:

            buff = fd.read()

        matches = self.magic_pattern.search(buff)

        if not matches:

            raise ReaderError("No 'Magic' block: not a valid WPS file")

        if matches.groups()[0] == "CHNKINK":

            raise ReaderError("Unable to convert a WPS file prior to version 8")

        headers_start = matches.start()

        entries_pos = headers_start + 24    

        total_entries = struct.unpack("<12xH",  buff[headers_start:headers_start+14])[0]

        while True:         

            entries, next_offset, text_header_offset, text_size = \

                    self._process_entries(buff[entries_pos:])           

            if text_size: #TEXT found

                break

            total_entries -= entries

            if total_entries and next_offset:

                entries_pos = next_offset + self.TEXT_BLOCK #Move to next block

            else:

                raise ReaderError("Unable to find TEXT secion. File corrupt?")

        text_offset = text_header_offset + headers_start #Move to start of text

        block_size = min(self.TEXT_BLOCK, text_size)

        text = buff[text_offset:text_offset+block_size]

        text_size -= block_size

        block_size = min(self.TEXT_BLOCK, text_size)

        if text_size:

            text_offset = 0x800 #Seems to always be the location of second block

            text += buff[text_offset:text_offset+block_size]

            text_size -= block_size

        if text_size:

            text_offset = text_header_offset + headers_start + self.TEXT_BLOCK

            text += buff[text_offset:text_offset+text_size]

        return self.strip_pattern.sub("\r\n", unicode(text, "UTF16"))

import sys

print WPSReader(sys.argv[1]).extract_text()

The greatest number of files are the old style Word 95-2003 (doc) files. Now I need to try do the same with those!

Wednesday, 18 April 2012

Reading Microsoft Works WPS files in Python

No comments:

Post a Comment

Search This Blog

Followers