In the first part of the article we described the general problem, proposed an architecture and outlined an implementation of the solution.
Now, we are going to focus on playing with the application, provide some statistics regarding stream processing of tweets and create some real-time visualizations.
Before running the application, we were looking forward to seeing a heatmap, dynamically evolving due to the inflow of loads of the geolocalized tweets. Unfortunately, we came upon a sheer disappointment. There were only a few new tweets appearing on the world map each few seconds. When we investigated the issue closer, it turned out that tweets hardly have geolocalization information attached.
We prepared a test, in which we were processing incoming tweets and checking how many of them are geotagged. We used a non-filtered stream of randomly sampled tweets and then filtered out those which weren’t geotagged.
|All tweets||Tweets with exact geolocation||Tweets with Place|
|Number of tweets||3 977 238||11 513||27 213|
|Rate (tweets/h)||165 718||480||1 134|
|Percentage of all||100%||0.29%||0.68 %|
These results acknowledged that overall amount of tweets with geolocation is very small. Barely 1% of processed tweets are geotagged (either with exact GPS coordinates or with approximate location of a place). When investigating this issue more deeply, we stumbled upon an interesting paper of Japanese researchers, who tried to use Twitter for earthquake detection and localization. In this research, to determine tweets geoposition, they used the location which Twitter user registered with and assumed that the tweet was posted from that place. During the registration process users are asked to provide their location as a text. Therefore, to obtain GPS coordinates of the location, researches were sending a request to Google Maps API in order to translate the name of the location into its approximate latitude and longitude. However, it is a clever approach, but is not very reliable in the case of users who tweet when travelling or those, who during the registration process, provided a bogus location.
On average, over 500 million tweets are sent per day. So it’s around 21 million per hour (of which 210 000 should be geotagged, based on the above percentages). Such a huge amount of tweets is randomly sampled on Twitter side and returned by their API. Limits imposed by Twitter Streaming API let us achieve an average rate of 165 000 tweets per hour. Thus, it’s less than the number of geotagged tweets generated every hour. If we could ask Twitter to prefilter tweets, in terms of containing geolocation information, before sampling, we could process only geotagged tweets and still use the API up to the limits. And actually we can.
Twitter API allows to request only geolocated tweets which fall within the requested bounding boxes. Unfortunately, Spark has no possibility to create a Twitter stream with geolocation filtering yet. Consequently, we provided our own implementation of ReceiverInputDStream to support that (TwitterGeoInputDStream).
|Geolocalized tweets||Tweets with exact geolocation||Tweets with Place|
|Number of tweets||3.901.877||520.510||3.381.367|
|Rate||162.578 tweets/h||21.688 tweets/h||140.890 tweets/h|
|Percentage of all||100%||13.34%||86.66%|
These results are a great news for the sake of creating a real-time visualization of incoming tweets based on a world map. However, we doubt that those prefiltered tweets could be used for drawing any far-reaching conclusions in social research, as they are sampled from only 1% of the total and their contents rather differ from general tendencies (as we noticed, most of geotagged tweets involve travelling and food).
World Tweeting Tendencies – Heatmaps
In spite of the low percentage of geolocalized tweets among all randomly sampled ones, we performed some attempts of visualizing them on a map. For this reason, we ran the application for about twenty-four hours to gather a reasonable set of geolocalized tweets (about eight thousand) and visualized all of them on a map.
Base on geolocalized tweets, we can observe that the most active twitter users are in the USA, Indonesia and in Europe (especially the UK and Turkey).
As the greatest number of active users is located in the USA, we have also observed a heatmap for tweets from this country.
Going further, we can roughly determine Twitter popularity across the continents.
After the test with tweets collected from a randomly sampled stream, we run another 24-hour test, but this time with a stream with location prefiltering. The amount of incoming tweets was too great to put them on a Google Maps heatmap, so the map was dynamically updated with 5 000 of the newest tweets.
In above GIF image, we can observe tweeting trends depending on time during the day. We started the test in the afternoon (CET zone), so the greatest intensity we can observe in Europe and the eastern part of the USA. With time, this wave moves to the western coast of the USA and to Asian islands (it’s nighttime in Europe). Finally, in the morning hours, we can observe a great tweeting intensity in Europe and Asia, whereas America is asleep.
A similar test to the described in the previous section, we performed using map markers instead of a heatmap. With this kind, we are limited to around 300 tweets, because more markers make the map both unreadable and slow.
Despite that the map becomes less readable with markers, it better visualizes how tweets flow in. Interestingly, we can clearly see how the USA is divided (in terms of the number of Twitter users). There are lots of active users across the entire eastern part, whereas in the west the activity is increased only in California and Washington.
For a short test, we’ve added basic keyword-based filtering of the incoming tweets. As an example we tried to filter tweets that involve Java, so we used such keywords as: “java”, “spring”, “hibernate”, “vaadin”, “ejb”, etc.
As it can be presumed, keywords are not enough to classify tweets into categories. This in particular applies to classifying tweets that involve programming, because programming frameworks are usually named with words from the “normal” world.
Of course, one could argue that in the case shown in the picture, somebody could have in mind the new upcoming version of Spring Framework, but…we all know the truth.
Implementing and playing with this small application, not only brought us tons of fun, but also a lot of useful information about processing geolocalized tweets. We also hope that this article proved that an application with well-designed architecture requires a small amount of code to be able to combine various technologies and be functional at the same time.
2. ⌃ Takeshi Sakaki , Makoto Okazaki , Yutaka Matsuo, “Earthquake shakes Twitter users: real-time event detection by social sensors”, Proceedings of the 19th international conference on World wide web, 2010, USA