Analytics/Kraken/Firehose

This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!

This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.

Pipeline Components edit

Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.

Sources / Producers edit

Agents / Buffers / Brokers edit

Sinks / Consumers edit


Possible Pipelines edit

udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS edit

This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.

Flume UDPSource -> HDFS edit

udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS edit

udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS edit

UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS edit

Storm Pipeline edit

The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.

Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS

Links edit

  • Kafka Spout - a Kafka consumer that emits Storm tuples.
  • Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
  • HDFS API Docs for FileSystem.append()