digitally-disturbed: rtf

Saturday, 5 May 2012

ShelobPy now usable!

https://bitbucket.org/swbsystems/shelobpy

My python based document text extraction code is now fairly usable. So far it reads:

Word doc files (The recent Word 97-2003 format as well as Word 95)
Word docx files (Word 2007 onwards)
Microsoft Works wps files
Open Office odt files
PDF files
Rich Text rtf files
HTML files (also seems OK with the awful Word HTML files)

There will be probably be instances where the code chokes or where it doesn't pull out what it should, but any instances of this or suggestions would be welcome.

It uses pyPdf as an external dependency. I'll have a go at the pdf format myself when I get a chance, but for now pyPdf is an easily accessible and stable package so I'll use it

Friday, 13 April 2012

ShelobPy: Python File Spider

Added some code to Bitbucket today - https://bitbucket.org/swbsystems/shelobpy

I need a way to pull text out of various document files (Word - both new and old, Open Office, Rich Text, MS Works, PDF, HTML...etc). I had worked on a C project to do this, but the libraries I needed were a pain to install and get working. It had to run on Linux as its going on a web-server and I didn't want to add the Open Office runtime to do the conversions.

So far I have used pyPDF to pull out PDF text, Beautiful Soup to make a little sense of some terrible Word-HTML mark-up and the rest has been done with a little brute force and regular expressions!

It is still __VERY__ hackish, but does use some neat stuff like Natural Language Processing to pull out some decent search terms. Still lots of work to go on it though!

Saturday, 5 May 2012

ShelobPy now usable!

Friday, 13 April 2012

ShelobPy: Python File Spider

Search This Blog

Followers