Socializing
Building an Application to Consume and Filter Twitter Firehose Data
Building an Application to Consume and Filter Twitter Firehose Data
Consuming Twitter's Firehose, the real-time stream of tweets, is a monumental task. With an estimated 110 million tweets per day, it presents a considerable data challenge.
Challenges and Solutions
The sheer volume of data can be overwhelming, making it imperative to have a robust solution in place. Apache Flume is often used for handling large volumes of data and ensuring scalability. However, to meet the specific needs of processing Twitter's stream, additional customization is required. Flume is primarily designed for log collection, so it needed several modifications to suit our requirements. The Flume community, active and supportive, provided much-needed assistance during this process.
Third-Party Services and Fees
Directly accessing the Firehose requires significant financial investment, and alternative services such as GNIP provide options for filtering and accessing the stream. Power Track, offered by GNIP, allows remote filtering of the Twitter stream with a cost of a few thousand tweets per month. For those looking for a more cost-effective solution, Spritzer can be utilized, though it provides access to a limited portion of the Firehose.
Setup and Infrastructure Requirements
To effectively consume and process the Twitter Firehose or one of its variants (e.g., Gardenhose, Sprinker), a dedicated server with ample bandwidth is essential. For smaller volumes, a Virtual Private Server (VPS) may suffice. When it comes to the software, any major server-side language capable of maintaining long-running connections without excessive memory consumption can be used. PHP, Java, Ruby, Perl, and others fit this requirement. Among them, we have opted for PHP for its versatility and ease of use.
Queueing software like Beanstalkd, Gearman, Starling, RabbitMQ, or similar solutions are integral to the process. Queuing ensures that incoming data is managed and distributed effectively, preventing the system from being overwhelmed during large influxes of tweets. By queuing everything sent by Twitter and processing it through a separate thread or queue, you can maintain system stability, especially during peak activity.
Rate Limits and Privacy
Twitter's streaming APIs are subject to rate limiting, which prevents excessive use and protects their resources. Standard users may not receive the full Firehose, limited to a specific percentage of the overall volume. Companies with significant financial backing can occasionally acquire the full Firehose.
Our agreement with Twitter does not allow us to disclose the exact volume of data we can access. The only third-party service with authorized access to the Firehose is GNIP, which recently introduced a product called Power Track for filtering and accessing the stream.
Conclusion
Building an application to consume and filter the Twitter Firehose data requires careful planning and robust infrastructure. Utilizing tools like Flume, Customized third-party services, and queueing systems can help manage the vast amount of data efficiently. Understanding the rate limits and the available third-party services is crucial for maximizing the potential of real-time Twitter data.