digitally-disturbed: May 2012

https://bitbucket.org/swbsystems/shelobpy

My python based document text extraction code is now fairly usable. So far it reads:

Word doc files (The recent Word 97-2003 format as well as Word 95)
Word docx files (Word 2007 onwards)
Microsoft Works wps files
Open Office odt files
PDF files
Rich Text rtf files
HTML files (also seems OK with the awful Word HTML files)

There will be probably be instances where the code chokes or where it doesn't pull out what it should, but any instances of this or suggestions would be welcome.

It uses pyPdf as an external dependency. I'll have a go at the pdf format myself when I get a chance, but for now pyPdf is an easily accessible and stable package so I'll use it

Saturday, 5 May 2012

ShelobPy now usable!

Search This Blog

Followers