2 Comments

Using Complex Event Processing (CEP) with Microsoft StreamInsight to Analyze Twitter Tweets 7: The Sample Application 3

Note: This post is one of a series, the overview can be found here: Complex Event Processing with StreamInsight

Writing the StreamInsight Queries

The sample code to the application can be downloaded here:

Scenario 1: Tweets per Second

As we can see from our pass-through implementation from the last post, we receive a large amount of tweets. In the first query we would like to find out how many tweets we receive per second. This turns out to be pretty easy!

image

We define a tumbling window with a length of one second and just call the count property. StreamInsight uses three different types of windows:

  • TumblingWindow: A fixed length window. The next window begins when the current one ends.
  • HoppingWindow: A fixed length overlapping window. We can define the time interval in which we want the next window. Example: We have a 3 second window that moves in one second hops. So we get updated data every second for the last 3 seconds.
  • SlidingWindow: The sliding window reacts to the input stream and always moves when there is a change in the stream. The window size is defined by two adjacent events.

In order for the example to work, we have to do two more things. First, we need to adjust the consoleObserver. Since the query produces scalar count values (and not TweetItems anymore) we need to adjust the generic type parameter in the DefineObserver() method from TweetItem to long.

image

Next, we need to replace the twitterstream instance with the query for the Bind() call. We now have a query that operates on the data source and bind the observer to the query.

image

That’s it! The output should look somewhat like this:

image

Amazing stuff already, but wait, there’s more!

Tweets per Second grouped by Language

For the next query we are going to group the tweets by language. We first need a data object for the computed output. As a little twist we add a method call that resolves the culture info that we find in the Tweet.

image

Then we continue and write our query. Remember that we have to adjust the generic type parameter of the observer to use the new LanguageSummary type:

image

The query leads to output like this:

image

Nice.

Language with most Tweets per Second

Next, we improve the last query by adding another query that uses the output of the last example as input. We want to know the top 5 languages in terms of tweets per second. We use the SnapShot window here to react in changes in the stream, that is changes in the output of the last query.

image

image

The most popular Tweet every Second

Next, we want to find the most popular Tweet every three seconds. First, we create a new data class:

image

Then we write a query that groups all the tweets in every 3 second window by the number of followers that the user has:

image

image

Note: The numbers after the user name show (Followers/Friends).

In the next step we improve the last example by adding the Friendcount. We want to know the person in every three second window that has most followers AND most friends. We implement this by writing two different queries for followers and friends and then join them together using a StreamInsight join operation.

Let’s start by writing the most friend query first. It is similar to the followers query:

image

Now we use the join to join the queries on the user name:

image

And that is it. If we run the sample we find out who the most popular person on twitter is ever three seconds:

image

An interesting observation that we can make looking at the data is that in some windows, e.g. 06:21:57, there is no event. This means that at that point the person with most tweets and most friends was not the same person.

The join is a very powerful operation that can give us deep insight into the data.

If you want to play around and try out more complex queries, look at the resources in the first blog post of the series, especially at the LINQPad Samples and the Hitchhiker guide.

I hope you enjoyed this series on Microsoft StreamInsight. If it provided substantial value to you, feel free to donate. 🙂

  • John Tarbox

    This is an excellent series and the example works perfectly when run as an administrator.

    I would love to see a part 4 where you stored the tweets to a SQL Server database.

  • Manuel

    Hi John,
    Many thanks for your interest!
    Storing the tweets in to an SQL Server database should be a rather trivial task. I would suggest to use Entity Framework to create an entity model that maps your model data (the tweets) into the databased structures (the tables).

    An example can be found here: http://www.codeproject.com/Articles/498885/Introduction-to-Entity-Framework

    I am currently very busy, so I am not able to write a follow-up post right now.

    Best regards!
    MM