Showing posts with label rtf. Show all posts
Showing posts with label rtf. Show all posts

Saturday, 5 May 2012

ShelobPy now usable!

https://bitbucket.org/swbsystems/shelobpy 

 My python based document text extraction code is now fairly usable. So far it reads:
  • Word doc files (The recent Word 97-2003 format as well as Word 95)
  • Word docx files (Word 2007 onwards)
  • Microsoft Works wps files
  • Open Office odt files
  • PDF files
  • Rich Text rtf files
  • HTML files (also seems OK with the awful Word HTML files)
There will be probably be instances where the code chokes or where it doesn't pull out what it should, but any instances of this or suggestions would be welcome.

It uses pyPdf as an external dependency. I'll have a go at the pdf format myself when I get a chance, but for now pyPdf is an easily accessible and stable package so I'll use it  

Friday, 13 April 2012

ShelobPy: Python File Spider

Added some code to Bitbucket today - https://bitbucket.org/swbsystems/shelobpy

I need a way to pull text out of various document files (Word - both new and old, Open Office, Rich Text, MS Works, PDF, HTML...etc). I had worked on a C project to do this, but the libraries I needed were a pain to install and get working. It had to run on Linux as its going on a web-server and I didn't want to add the Open Office runtime to do the conversions.

So far I have used pyPDF to pull out PDF text, Beautiful Soup to make a little sense of some terrible Word-HTML mark-up and the rest has been done with a little brute force and regular expressions!

It is still __VERY__ hackish, but does use some neat stuff like Natural Language Processing to pull out some decent search terms. Still lots of work to go on it though!