In the first part of the article we described the general problem, proposed an architecture and outlined an implementation of the solution.

Now, we are going to focus on playing with the application, provide some statistics regarding stream processing of tweets and create some real-time visualizations.

Geolocalized tweets

Before running the application, we were looking forward to seeing a heatmap, dynamically evolving due to the inflow of loads of the geolocalized tweets. Unfortunately, we came upon a sheer disappointment. There were only a few new tweets appearing on the world map each few seconds. When we investigated the issue closer, it turned out that tweets hardly have geolocalization information attached.

We prepared a test, in which we were processing incoming tweets and checking how many of them are geotagged. We used a non-filtered stream of randomly sampled tweets and then filtered out those which weren’t geotagged.

 

All tweets Tweets with exact geolocation Tweets with Place[1]
Number of tweets 3 977 238 11 513 27 213
Rate (tweets/h) 165 718 480 1 134
Percentage of all 100% 0.29% 0.68 %

 

These results acknowledged that overall amount of tweets with geolocation is very small. Barely 1% of processed tweets are geotagged (either with exact GPS coordinates or with approximate location of a place). When investigating this issue more deeply, we stumbled upon an interesting paper of Japanese researchers, who tried to use Twitter for earthquake detection and localization[2]. In this research, to determine tweets geoposition, they used the location which Twitter user registered with and assumed that the tweet was posted from that place. During the registration process users are asked to provide their location as a text. Therefore, to obtain GPS coordinates of the location, researches were sending a request to Google Maps API in order to translate the name of the location into its approximate latitude and longitude. However, it is a clever approach, but is not very reliable in the case of users who tweet when travelling or those, who during the registration process, provided a bogus location.

On average, over 500 million tweets are sent per day[3]. So it’s around 21 million per hour (of which 210 000 should be geotagged, based on the above percentages). Such a huge amount of tweets is randomly sampled on Twitter side and returned by their API. Limits imposed by Twitter Streaming API let us achieve an average rate of 165 000 tweets per hour. Thus, it’s less than the number of geotagged tweets generated every hour. If we could ask Twitter to prefilter tweets, in terms of containing geolocation information, before sampling, we could process only geotagged tweets and still use the API up to the limits. And actually we can.

Twitter API allows to request only geolocated tweets which fall within the requested bounding boxes[4]. Unfortunately, Spark has no possibility to create a Twitter stream with geolocation filtering yet. Consequently, we provided our own implementation of ReceiverInputDStream to support that (TwitterGeoInputDStream).

 

Geolocalized tweets Tweets with exact geolocation Tweets with Place[1]
Number of tweets 3.901.877 520.510 3.381.367
Rate 162.578 tweets/h 21.688 tweets/h 140.890 tweets/h
Percentage of all 100% 13.34% 86.66%

 

These results are a great news for the sake of creating a real-time visualization of incoming tweets based on a world map. However, we doubt that those prefiltered tweets could be used for drawing any far-reaching conclusions in social research, as they are sampled from only 1% of the total and their contents rather differ from general tendencies (as we noticed, most of geotagged tweets involve travelling and food).

World Tweeting Tendencies – Heatmaps

In spite of the low percentage of geolocalized tweets among all randomly sampled ones, we performed some attempts of visualizing them on a map. For this reason, we ran the application for about twenty-four hours to gather a reasonable set of geolocalized tweets (about eight thousand) and visualized all of them on a map.

World heatmap generated after running application for 24h
World heatmap generated after running application for 24h

Base on geolocalized tweets, we can observe that the most active twitter users are in the USA, Indonesia and in Europe (especially the UK and Turkey).

As the greatest number of active users is located in the USA, we have also observed a heatmap for tweets from this country.

Heatmap generated after running application for 24h for region of the US
Heatmap generated after running application for 24h for region of the US

Going further, we can roughly determine Twitter popularity across the continents.

Twitter’s popularity across the continents based on the number of geotagged tweets.
Twitter’s popularity across the continents based on the number of geotagged tweets.

After the test with tweets collected from a randomly sampled stream, we run another 24-hour test, but this time with a stream with location prefiltering. The amount of incoming tweets was too great to put them on a Google Maps heatmap, so the map was dynamically updated with 5 000 of the newest tweets.

Heatmap composed of the newest 10 000 tweets at the time
Heatmap composed of the newest 10 000 tweets at the time

In above GIF image, we can observe tweeting trends depending on time during the day. We started the test in the afternoon (CET zone), so the greatest intensity we can observe in Europe and the eastern part of the USA. With time, this wave moves to the western coast of the USA and to Asian islands (it’s nighttime in Europe). Finally, in the morning hours, we can observe a great tweeting intensity in Europe and Asia, whereas America is asleep.

Marker maps

A similar test to the described in the previous section, we performed using map markers instead of a heatmap. With this kind, we are limited to around 300 tweets, because more markers make the map both unreadable and slow.

Displaying the newest 300 geotagged tweets in real-time
Displaying the newest 300 geotagged tweets in real-time

Despite that the map becomes less readable with markers, it better visualizes how tweets flow in. Interestingly, we can clearly see how the USA is divided (in terms of the number of Twitter users). There are lots of active users across the entire eastern part, whereas in the west the activity is increased only in California and Washington.

Tweet markers on a map of the USA after running the application for 24h
Tweet markers on a map of the USA after running the application for 24h

Keyword-based filtering

For a short test, we’ve added basic keyword-based filtering of the incoming tweets. As an example we tried to filter tweets that involve Java, so we used such keywords as: “java”, “spring”, “hibernate”, “vaadin”, “ejb”, etc.

As it can be presumed, keywords are not enough to classify tweets into categories. This in particular applies to classifying tweets that involve programming, because programming frameworks are usually named with words from the “normal” world.

tweets with geolocation

Of course, one could argue that in the case shown in the picture, somebody could have in mind the new upcoming version of Spring Framework, but…we all know the truth.

Final word

Implementing and playing with this small application, not only brought us tons of fun, but also a lot of useful information about processing geolocalized tweets. We also hope that this article proved that an application with well-designed architecture requires a small amount of code to be able to combine various technologies and be functional at the same time.

References

1. ⌃ a b http://twitter4j.org/javadoc/twitter4j/Place.html

2. Takeshi Sakaki , Makoto Okazaki , Yutaka Matsuo, “Earthquake shakes Twitter users: real-time event detection by social sensors”, Proceedings of the 19th international conference on World wide web, 2010, USA
http://www.ymatsuo.com/papers/www2010.pdf

3. http://expandedramblings.com/index.php/march-2013-by-the-numbers-a-few-amazing-twitter-stats/10/

4. https://dev.twitter.com/streaming/overview/request-parameters#locations

Java Developer

I am an eager fan of fresh approach to Java programming, especially Java 8 features and annotation-based Spring stack. I enjoy solving Java quirks & gotchas, algorithmic puzzles and Rubik's cube. In spare time I adore fishing and low-cost travelling.