Analytics/Kraken/Firehose
This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!
This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.
Pipeline Components
editOur current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.
Sources / Producers
edit- udp2log
- Flume UDPSource (custom)
- Flume Spooling Directory Source
- KafkaProducer Shell (kafka-console-producer)
- Ori's UDPKafka
Agents / Buffers / Brokers
edit- Flume Memory Channel (volatile)
- Flume File Channel
- KafkaBroker
- plain old files
Sinks / Consumers
edit- Flume HDFS Sink
- kafka-hadoop-consumer (3rd party, has Zookeeper support)
- Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
- plain old cron jobs + hadoop fs -put
Possible Pipelines
editudp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
editThis is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.
Flume UDPSource -> HDFS
editudp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS
editudp2log -> files + logrotate -> cron hadoop fs -put -> HDFS
editUDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS
editStorm Pipeline
editThe ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.
Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS
Links
edit- Kafka Spout - a Kafka consumer that emits Storm tuples.
- Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
- HDFS API Docs for FileSystem.append()