Analytics/Kraken/Firehose

This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!

This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.

Pipeline Components

edit

Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.

Sources / Producers

edit

Agents / Buffers / Brokers

edit

Sinks / Consumers

edit


Possible Pipelines

edit

udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS

edit

This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.

Flume UDPSource -> HDFS

edit

udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS

edit

udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS

edit

UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS

edit

Storm Pipeline

edit

The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.

Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS

edit
  • Kafka Spout - a Kafka consumer that emits Storm tuples.
  • Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
  • HDFS API Docs for FileSystem.append()