[prev] [thread] [next] [lurker] [Date index for 2004/07/05]
On Thu, Jul 01, 2004 at 09:53:06AM +0100, Simon Wistow said:
> I've actually been playing around with this a solution to the 'slow to
> generate massive archives' problem over the last couple of days.
>
> I'll let you know how I get on.
I got a quick dose of tuits over the weekend and finished off what I'd
been doing.
Since Richard is away this was spectacularly bad timing since I don't
get to rely on the ulitmate debugging tool which is his Cluebat. However
I was in full on revision avoidance mode so I did it anyway. Whilst this
borrows code and concepts from the original Mariachi and cheekily steals
the name it's little more than a proof of concept (it's not much more
than a few hours work, all told). With that said, and with a host of
other disclaimers hinted at, here we go.
The good news is that this 'interpretation' is much easier to extend -
it uses plugins to generate each of the different views on the data. At
the moment there are plugins for
date
author
thread
lurker
thread
message (which displays individual messages)
atom (based on Ben's code)
and that regenerating pages is really quick once the messages are in the
store.
The bad news is that getting stuff into the store is very slow at the
moment. About 5 or 6 times slower than the original mariachi. Whilst in
some ways a slow down is to be expected this is not really acceptable.
However the other good news is that I think it should be ripe for
speeding up. My code is very naive in places (no memoize and I'm sure I
work stuff out twice in different places of the code [0]) and also I
have lots and lots of development stuff for Email::Store installed so,
for example, at the moment, every time we import a message we
Extract a summary
Work out what thread its in
Extract all the Named Entities from it
Do relationship mapping (http://blog.simon-cozens.org/6744.html)
Index a load of stuff into Plucene
Work out what mailing list the mail was sent to
Extract all the attachments
Store all this stuff in a DB
which is all well and good but is far more than Original Mariachi is
doing.
The code, if anybody wants to have a look, is here
http://www.thegestalt.org/simon/mariachi/Mariachi-0.1.tar.gz
http://www.thegestalt.org/simon/mariachi/Mariachi-0.1/
Which also has examples of output from a couple of hundred messages
You can, for example, find all the mails from me
http://www.thegestalt.org/simon/mariachi/author/simon@xxxxxxxxxx.xxx.xxxx
and then easily surf to a mail and hence the thread it was part of in
either normal or lurker form
http://www.thegestalt.org/simon/mariachi/lurker/723192D023274188007249BD17EBC478.MAI@xxxxxxxxxxx.xxx.xxxx
http://www.thegestalt.org/simon/mariachi/thread/723192D023274188007249BD17EBC478.MAI@xxxxxxxxxxx.xxx.xxxx
or see everything on a particular day
http://www.thegestalt.org/simon/mariachi/date/2004/06/01.html
There are bugs to do with threads, it looks like ass and there's no
paging but I think it's an interesting base to have a look at working
from. Or using to decide that Email::Store is not a good fit for
Mariachi.
Please check out the TODO list here
http://www.thegestalt.org/simon/mariachi/Mariachi-0.1/TODO
Simon
[0] It's not helped that Messages get slung around as Email::Simple,
Mariachi::Message and Email::Store::Message in various bits of the code.
This is obviously suboptimal. And ugly.
There's stuff above here
Generated at 09:00 on 03 Aug 2004 by mariachi 0.52