And Now for Something Completely Different…

Posted July 4th, 2007 by

I’m doing some more yahoo pipes work–aggregating and filtering blog feeds. I’ve created a combination of whitelist and a highly filtered set of search results known as Chardonnay, and I’ll eventually make a less-filtered “2-Buck Chuck” and a highly-filtered Eiswein version.

My basic rule-of-thumb for the Chardonnay feed is that if the signal-to-noise ratio of a blog is less than 3:1 or so, I would bump it into Tier 2. Not that they don’t have any good content, but I was trying to keep my feed at least 8:2 signal-noise ratio.

For the Eiswein feed, I’m aiming for 9:1 signal-noise ratio. In order to do that, I have to filter everything, including myself. =)

As far as 2-Buck Chuck, well, let’s say it’s so unfiltered that it has chunks^wpieces of sediment in it. It’s also hard to build something like this and intentionally disable the quality controls you’ve built.

“Why the wine motif?” you ask. Well, I was looking for something that has a price and quality range, so wine fit right in there. I bought www.chateaublogsville.com which will be the entry site for the 3 security blog feeds. It might take me a couple of weeks to get up a simple site but in the meantime you’re free to subscribe to any of the feeds.

One thing that I’m finding out about blog feeds. For the Chardonnay, I had to look at a couple of approaches to feed aggregation. I started out with a linked-to list of people and a desire to have a google and technorati catch-all search to find some relevant information from little-known feeds. After working with some data munging for a couple of days, I notice that the source feeds fit into the following groups:

  • Tier 1 Feeds that I want to let through pretty much unfiltered (Mine, Matasano, Curphey, ISM-Community, Bejtlich, etc)
  • Tier 2 Feeds that need to be filtered for relevancy (Security Bloggers Network members, news site aggregages that I haven’t whitelisted above)
  • Tier 3 Feeds that need to be filtered for spam and then filtered for relevancy whilst wearing lead gloves (technorati and google searches)

Now that I write it all down, it sounds exactly like writing email filters or SIEM tuning or any one of a bazillion uses that you could have for filtering, so I’ve once again recreated ideas that already exist. Of course, I probably could have saved some time by approaching the problem from this angle, but really I had to move the ideas around a dozen different ways until it fit in a way that made sense.

The funny thing is that I had the hardest time filtering on privacy.  I was getting too much junk off the blog search feeds (privacy of timeshares, that kind of thing), so what I’m playing with is killing privacy from the main filter and then filtering the search feeds on privacy and a second keyword.

The usual disclaimers work here: I’m playing with content provided by other people, so I don’t even remotely pretend to have any control over it. There are a couple pieces of junk that will slip through the filters. Because the source of the filters is open for the world to see, you can cheat them by including the right words.



Similar Posts:

Posted in Odds-n-Sods, Technical | 7 Comments »

7 Responses

  1.  shrdlu Says:

    Love the wine analogy! I wonder whether the Eiswein version could be further refined into a Tokaji Aszú, too sweet for most people’s tastes. Then you have the Manischewitz feed, the Communion feed, and the Melbourne Old-and-Yellow (for trolls, of course, as it’s a fighting wine). 🙂

  2.  rybolov Says:

    OK, I put up a rough draft of the site. It will get bigger and better, probably by the time anybody gets around to visiting it. =)

    http://www.chateaublogsville.com/

  3.  Anton Chuvakin Says:

    Warning: ego question alert 🙂

    Which list am I in?

  4.  rybolov Says:

    Hi Anton

    Right now you’re in the greylist with the rest of the SBN. I take a direct feed for SBN off feedburner and filter based on keywords such as “risk management”, ” or even “27001”.

    I’m selectively adding blogs from SBN into the whitelist as I evaluate their content ratio.

    You *would* make it into the A-list if you would quit saying “the C-word” (no, not that C-word, the “c*mpliance” word that I just can’t write in a comment on my own blog) as much. =)

    You can view the “source code” at the following url:
    http://pipes.yahoo.com/pipes/person.info?eyuid=UvQKiBgwp2fsy.eo2MBuxA–

    Right now I’m going through the output as blogs are added to the feed so that I can tune the end result. I always take suggestions if you have a tuning idea.

  5.  Marcin Says:

    I feel honored 🙂

    You said you wanted your whitelist security feeds to go unfiltered. However, you are filtering them, the whitelist AND greylist… shouldn’t the whitelist be connected via union after the permit filter?

  6.  rybolov Says:

    There is a tiny bit of filtering in Chardonnay right before I rack, stack, and truncate, but I did put the whitelist in the right spot–after the permit filter (what I call the “relevancy filter”)

    If you look at Eiswein, I did union before I did the relevancy filter, so yeah, that’s intentional–it filters out my own zombie posts. =)

    Now the fun thing is that you can use my filters and fork them into your own personal feed.

  7.  dre Says:

    OPML please.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.


Visitor Geolocationing Widget: