re: [siesta-dev] Mariachi Pluggability

[prev] [thread] [next] [lurker] [Date index for 2004/07/05]

From: Simon Wistow
Subject: re: [siesta-dev] Mariachi Pluggability
Date: 13:12 on 05 Jul 2004
On Thu, Jul 01, 2004 at 09:53:06AM +0100, Simon Wistow said:
> I've actually been playing around with this a solution to the 'slow to 
> generate massive archives' problem over the last couple of days. 
> 
> I'll let you know how I get on.

I got a quick dose of tuits over the weekend and finished off what I'd 
been doing.

Since Richard is away this was spectacularly bad timing since I don't
get to rely on the ulitmate debugging tool which is his Cluebat. However
I was in full on revision avoidance mode so I did it anyway. Whilst this
borrows code and concepts from the original Mariachi and cheekily steals
the name it's little more than a proof of concept (it's not much more
than a few hours work, all told). With that said, and with a host of
other disclaimers hinted at, here we go.

The good news is that this 'interpretation' is much easier to extend - 
it uses plugins to generate each of the different views on the data. At 
the moment there are plugins for 

	date
	author
    thread
	lurker 
	thread
    message (which displays individual messages) 
	atom    (based on Ben's code)

and that regenerating pages is really quick once the messages are in the 
store.

The bad news is that getting stuff into the store is very slow at the 
moment. About 5 or 6 times slower than the original mariachi. Whilst in 
some ways a slow down is to be expected this is not really acceptable.

However the other good news is that I think it should be ripe for
speeding up. My code is very naive in places (no memoize and I'm sure I
work stuff out twice in different places of the code [0]) and also I
have lots and lots of development stuff for Email::Store installed so,
for example, at the moment, every time we import a message we

	Extract a summary
	Work out what thread its in
	Extract all the Named Entities from it
	Do relationship mapping (http://blog.simon-cozens.org/6744.html)
	Index a load of stuff into Plucene
	Work out what mailing list the mail was sent to
	Extract all the attachments
	Store all this stuff in a DB

which is all well and good but is far more than Original Mariachi is 
doing.

The code, if anybody wants to have a look, is here

	http://www.thegestalt.org/simon/mariachi/Mariachi-0.1.tar.gz
	http://www.thegestalt.org/simon/mariachi/Mariachi-0.1/

Which also has examples of output from a couple of hundred messages

You can, for example, find all the mails from me

http://www.thegestalt.org/simon/mariachi/author/simon@xxxxxxxxxx.xxx.xxxx

and then easily surf to a mail and hence the thread it was part of in
either normal or lurker form

	http://www.thegestalt.org/simon/mariachi/lurker/723192D023274188007249BD17EBC478.MAI@xxxxxxxxxxx.xxx.xxxx
	http://www.thegestalt.org/simon/mariachi/thread/723192D023274188007249BD17EBC478.MAI@xxxxxxxxxxx.xxx.xxxx

or see everything on a particular day

	http://www.thegestalt.org/simon/mariachi/date/2004/06/01.html


There are bugs to do with threads, it looks like ass and there's no 
paging but I think it's an interesting base to have a look at working 
from. Or using to decide that Email::Store is not a good fit for 
Mariachi.

Please check out the TODO list here

	http://www.thegestalt.org/simon/mariachi/Mariachi-0.1/TODO

Simon



[0] It's not helped that Messages get slung around as Email::Simple, 
Mariachi::Message and Email::Store::Message in various bits of the code. 
This is obviously suboptimal. And ugly.
There's stuff above here

Generated at 09:00 on 03 Aug 2004 by mariachi 0.52