Saturday 5 May 2012

ShelobPy now usable!

https://bitbucket.org/swbsystems/shelobpy 

 My python based document text extraction code is now fairly usable. So far it reads:
  • Word doc files (The recent Word 97-2003 format as well as Word 95)
  • Word docx files (Word 2007 onwards)
  • Microsoft Works wps files
  • Open Office odt files
  • PDF files
  • Rich Text rtf files
  • HTML files (also seems OK with the awful Word HTML files)
There will be probably be instances where the code chokes or where it doesn't pull out what it should, but any instances of this or suggestions would be welcome.

It uses pyPdf as an external dependency. I'll have a go at the pdf format myself when I get a chance, but for now pyPdf is an easily accessible and stable package so I'll use it