digitally-disturbed: ShelobPy: Python File Spider

Added some code to Bitbucket today - https://bitbucket.org/swbsystems/shelobpy

I need a way to pull text out of various document files (Word - both new and old, Open Office, Rich Text, MS Works, PDF, HTML...etc). I had worked on a C project to do this, but the libraries I needed were a pain to install and get working. It had to run on Linux as its going on a web-server and I didn't want to add the Open Office runtime to do the conversions.

So far I have used pyPDF to pull out PDF text, Beautiful Soup to make a little sense of some terrible Word-HTML mark-up and the rest has been done with a little brute force and regular expressions!

It is still __VERY__ hackish, but does use some neat stuff like Natural Language Processing to pull out some decent search terms. Still lots of work to go on it though!

Friday, 13 April 2012

ShelobPy: Python File Spider

No comments:

Post a Comment

Search This Blog

Followers