By Ingeniweb. A Django site.
Décembre 14, 2008
» Looking for beta testers for Atomisator


I am looking for beta testers, interested in customized rss feeds or email alerts experimentations.

Here’s a list of services Atomisator can provide :

  • You run a project and you would like to receive a daily summary in your mailbox on what is being said about it in blogs, tweets, etc
  • You have a list of feeds you want to aggregate, with specific filters and you can’t manage to do with Yahoo pipes or any tools out ther, because it is too specific.
  • You want to annotate entries in a feed with extra information
  • etc..

What you get as a beta-tester:

  • a custom Atomisator configuration that fills your needs
  • I am hosting the service, and you get
    • either an url on my server to an xml file you can read in your aggregator
    • either one mail per day

What you are not getting as a beta tester:

  • you don’t get any guarantee on the output or the reliability, these are just experimentations.
  • if it’s down I can’t promise when it will be up again

Let me know by mail if you are interested

      

Décembre 7, 2008
» A PostRank plugin for Atomisator


Yesterday, I bumped into PostRank. This system is collecting data from various social systems like Twitter and provides a service where you can type in an url of a blog post or a entire blog. You get a PostRank depending on the popularity of the URL.

I wrote a plugin for Atomisator and ran it on my own blog. Here’s the result:  http://ziade.org/afpy/

And the Atomisator configuration for this is :

[atomisator]
sources =
    rss http://tarekziade.wordpress.com/feed/atom/

database = sqlite:///carpet.db

outputs =
    rss  public/rss.xml "http://tarekziade.wordpress.com/feed/atom/" "Carpet Python with PR" "Powered by Atomisator"

enhancers =
    postrank

How PostRank works

PostRank works with urls you provide, on their web interface or through their web services.

As long as these url are present in their big cloud-computing based system, they provide a rank that is calculated with the number of comments related to the blog, the number of tweet messages that refers to it, and so on. The complete algorithm they used is secret but this is not the point. I have secret algorithms too ;).

The point is that they are trying to categorize blog entries using social networks as indicators, and that they have a huge database.

Social indicators in Atomisator

This is one of the approach I have with Atomisator, when it is used to build a planet. For instance I have a Digg plugin that will inject in each entry the comments found on Digg if the entry was digged. It also present the number of Digg. Of course this is done live because I don’t have a cloud-computing based system where I store data. I use Digg webservice on the fly. (On the fly here doesn’t mean Atomisator make the calls to Digg from the Planet application of course. It means Atomisator calls them when it creates the merged feed on the system)

The benefit of this approach is that I can provide a social indicator on a post immediatly. Systems like PostRank will not work on entries that are too recent because their spiders have a lag of one week or so.

The pitfall of my approach is that I am unable to calculate trends because I don’t store the indicators as they vary.

But if someone wanted to build a BtoC application using Atomisator, they could implement a set of plugins based on Amazon tools to make them store data in a more scalable way and in time.

Next steps

So I have this new PostRank plugin, and this is awesome because I have added a treshold parameter in it. Basically if a post has a high PostRank value, it will appear in the Planet. If it’s low, it can be automatically removed. The fact that PostRanks are lagging for new entries is not a problem: interesting posts will eventually pop after a few days in the Planet.

This is perfect to reduce the number of entries in an aggregator.

But I do want to write my own PostRank that works live, with no storage at all. Because the whole point of Atomisator is to provide a framework where anyone can try out various filtering combinations.

So to be able to provide this power, it needs to work just by collecting data directly from the social services, like the PostRank plugin does with this PostRank “meta-service”. The next step is therefore to see if I can query services like Twitter to list the twits related to an url, without having to store the twitter feed myself.

In any case, if my talk on Atomisator at Pycon 2009 is selected, the PostRank plugin will be shown besides the Digg plugin.

      

Novembre 9, 2008
» How to receive email alerts when someone talks about something - 6 steps tutorial using Atomisator


I like Google Alert, the idea of receiving a mail every day that summarizes all articles related to a given topic is really helpfull when you need to focus on a specific subject for a while.

But this is not enough. I want to receive a mail that points me to any mailing list or planet feed or blogs out there as well, that talks about the topic.

You can’t do it with Google Alerts as far as I know.

Let’s take an example:

I want to receive a daily mail that points me to any mail thread or blog entry, that is related to the word “buildout” or to the word “pycon”.

Basically, to do it manually, I need to read Planet Python, Planet Zope, then take a look at the Python, Zope and Plone mailing lists. It takes at least 10 minutes, and more if you want to read all entries to make sure you won’t miss anything.

Since online systems like Nabble provides RSS feed for mailing lists (don’t find yours ? just add it there !), it is easy to read them as they where regular feeds.

From there, a script that reads all the selected feeds and sends a mail pointing to the entries that match the selected words is simple to write as well, and fill the need.

But don’t code it : Atomisator will let you do this with a few lines of configuration.

Here’s a step-by-step tutorial.

Step 1 - install easy_install

Step 2 - install Atomisator and SQLite

Step 3 - create an “atomisator.cfg” file

The content of the file has to be:

[atomisator]
store-entries = false

sources =
  rss http://www.nabble.com/Python---python-list-f2962.xml
  rss http://n2.nabble.com/Plone-f293351.xml
  rss http://www.nabble.com/Zope---General-f6715.xml
  rss http://planet.python.org/rss10.xml
  rss http://www.zope.org/Planet/planet_rss10.xml
filters =
  buzzwords words.txt
outputs =
  email email.cfg

This file will look into Planet Python, Planet Zope and various mailing lists (Python, Plone, Zope). Of course you can add or remove feeds in the sources option.

Step 4 - Create the words.txt file

This file contains regular expressions, one per line, that will be used to match the entries. The file has to be saved besides atomisator.cfg.

For our example:

buildout
pycon

You can put any expression you want in this file, as long as you have one matching expression per line.

Step 5 - add an email.cfg configuration file.

This is where you define the target emails that will receive the alerts (tos option). You can also specify the from email, or the smtp server location. The file has to be saved besides atomisator.cfg.

In our case it can be:

[email]
tos = tarek@ziade.org
from = tarek@ziade.org
smtp_server = smtp.neuf.fr

Step 6 - Run it !

The command to be called is atomisator (installed by easy_install) followed by the configuration file:

$ atomisator atomisator.cfg
Reading data.
Launching worker for rss - ('http://www.nabble.com/Python---python-list-f2962.xml',)
Launching worker for rss - ('http://n2.nabble.com/Plone-f293351.xml',)
Launching worker for rss - ('http://www.nabble.com/Zope---General-f6715.xml',)
Launching worker for rss - ('http://planet.python.org/rss10.xml',)
Launching worker for rss - ('http://www.zope.org/Planet/planet_rss10.xml',)
Retrieving from rss - ('http://www.nabble.com/Python---python-list-f2962.xml',)
Retrieving from rss - ('http://www.nabble.com/Zope---General-f6715.xml',)
Retrieving from rss - ('http://n2.nabble.com/Plone-f293351.xml',)
Retrieving from rss - ('http://planet.python.org/rss10.xml',)
Retrieving from rss - ('http://www.zope.org/Planet/planet_rss10.xml',)
.................................................................................................................................................
Writing outputs.
Data ready.

Check your mails. This call can be put in a daily cron.

Tested under Mac OS X and Linux.

      

Septembre 4, 2008
» Yet another Planet


Atomisator is a framework so it is hard to get an idea of its features until a real application uses it.

That is why I wrote a small application in Pylons called Yap (Yet Another Planet), that is basically displaying the XML file produced by an Atomisator instance. Since Atomisator does all the work, the Pylons apps is really small (one or two controllers, that’s it).

My first use case was to produce a nice, smart Planet for our user group Afpy.

Here’s a first draft: http://ziade.org/afpy/

You can play with ‘j’, ‘k’ and arrows to open and close posts, but I am still working on this, so it will also scrolling the window when you are on a post.

Anyways, it grabs various French sources for Python and uses these plugins from Atomisator:

  • filter : reddit
  • filter : delicious
  • filter : doublons
  • enhancer : related
  • enhancer : digg

The result is basically following reddit and delicious links to display an extract of the page linked, and display digg comments as well. Duplicate are removed as well. A list of related entry are also added to each entry.

It is based on this configuration file, Atomisator uses to generate an XML file for Yap in a cron:

[atomisator]

sources =
    rss     http://del.icio.us/rss/tag/python+fr    Delicious
    rss     http://www.afpy.org/search_rss?portal_type=AFPYNews&sort_on=Date&sort_order=reverse&review_state=published Afpy News
    rss     http://feeds.feedburner.com/Baderlog/python Bader
    rss     http://www.biologeek.com/journal/rss.php?cat=Python Biologeek
    rss     http://www.gawel.org/weblog/rss/python/afpy/zope/zope3/rss.xml  Gawel
    rss     http://www.haypocalc.com/blog/rss.php?cat=Python    Haypo
    rss     http://jehaisleprintemps.net/blog/rss/  No
    rss     http://programmation-python.org/sections/blog/exportrss Tarek
    rss     http://api.blogmarks.net/rss/tag/python,fr  Blogmarks

# put here the database location
database = sqlite:///afpy.db

# this is the file that will be generated
file = /home/tarek/www/packages/Yap/trunk/yap/public/afpy.xml

# infos that will appear in the generated feed.
title = Planet Python Francophone
description = Le planet de l'Association Python Francophone, et des gens heureux.
link =  http://www.afpy.org/planet/

filters =
    reddit
    delicious
    doublons

enhancers =
    related
    digg

What’s Next ?

Since now, there were no attempt to try to automatically classify entries. The next plugin I am working on will provide a Naive Bayesian filter to classify entries, together with a way to train it through the Yap web interface. basically a ‘keep’/'ditch’ button.

I will also set an english Planet Python to see how things go with more sources.

Août 27, 2008
» Atomisator, visiting links


I am writing a plugin for Atomisator that detects when a post is a Reddit or a Delicious entry, and adds a sample from the page it links to.

On a Reddit feed for example, you will basically get a meaningful title, and summary that will look like this:

[link] [comments]

This is not really useful in a feed…

So basically, the plugin I am writing is detecting this kind of entries and is grabbing a sample out of the linked page, so the entry is transformed like this:

Extract from the link:
    ... blablablba
    blablabla...
[link] [comments]

The extracted text is a pure text, extracted using BeautifulSoup and the SGML parser from the standard library.

This is quite useful as long as the begining of the web page is meaningful, which is rarely the case… Most of the time the web page starts with a gibberish text, like a menu bar content for instance.

So I am trying to detect what is the “best” part of the web page pointed by the link.

To do so, I am using the title of the entry, which is suppose to make sense. Since there are good chances the text will contain the words used in the title, I am looking for them into the page.

I have tried several combinations, even by trying to find the smallest sample where I get the maximum number of words from the title by doing some cartesian products. But this was too slow.

Instead, I am trying to detect the real beginning of the post by trying some common patterns : most of the time the body of the post is a div tag with a class attribute like body, content, article-content, etc..

I am running it now over the Python feed at Reddit, and the results start to look nice so far (==you can understand what the page talks about most of the time). See here: http://www.ziade.org/atomisator/filtered.xml

Now I will try to run it over a fair amount of entries and with various sources to try to tune up the extractor.

This code will also be a useful base to visit links of any kind of entry, but it needs a lot of cleanup: I have spotted some quadratic complexity parts I need to get ridd of.

Try it yourself :

  1. make sure you have SQLite installed
  2. install atomisator with easy_install atomisator
  3. create your atomisator.cfg file with the content below
  4. then run it by calling ‘atomisator’ in the folder atomisator.cfg lives

atomisator.cfg content :

[atomisator]
sources =
    rss http://www.reddit.com/r/Python.rss reddit

database = sqlite:///atomisator.db
file = atomisator.xml
title = Meta feed
description = Automatic feed created by Atomisator.
link =
filters =
    reddit

Août 20, 2008
» Atomisator, a framework to build custom RSS feeds


We are all overwhelmed by the amount of data in our feed readers. While this problem is unavoidable if you keep on adding new feeds in it, they could be automatically filtered and categorized to reduce the flow of data.

I wanted for a long time to try out some custom filters over my feeds to find for example related entries, by trying to understand the meaning of the posts, using tools like NLTK.

So I needed a playground for this, where I could play with feeds.

I think the closest tool for this is to use Yahoo Pipes but as far as I know, the only way to create custom filters is to run a web service and call it from Yahoo Pipes.

Anyways, I started to code a framework (at first it was an example for my latest book) that looks a lot like Yahoo Pipes in its principles. I don’t have any User Interface at this time of course, but a simple plugin-based tool that will let me combine my code snippets with feeds.

It is called Atomisator (see http://atomisator.ziade.org).

The big picture

The big picture

The process is quite simple:

  1. Readers are plugins that know how to read a source and provide entries out of it.
  2. Filters are plugins that know how to remove unwanted entries, or enhance them (change their title, summary, etc.). They can be combined.
  3. the entries are then pushed in a database. This is useful to avoid doublons, and to keep track of past entries.
  4. to create the feed, the entries are read from the database
  5. Enhancers are plugins that will add to entries extra info. Typically info that can’t be stored, like Digg comments if the entry is detected on Digg, or Google related searches, and so on
  6. The feed is then generated.

Right now I am focusing on making it fast, which is not simple because the plugins can play with all entries in the database.

It is in early stage and undertested, but it kinda works. I pushed it at PyPI to see of it meets interest. If it does, I will document the process of writing plugins.

Make sure you have SQlite installed, and give it a try :

$ easy_install atomisator.main
$ atomisator -c atomisator.cfg
$ atomisator

You will have an atomisator.xml feed created. You can add other feeds in atomisator.cfg as well and try them.

Now with this environment, I can start to try out custom algorithms over my feeds.

I’ve been told the name doesn’t sound right in Ehglish, but it does in French so I keep it ;)