This is a place to brainstorm solutions about ways to reliably get data into Kraken HDFS from the udp2log multicast firehose stream on a short term basis. We want data nooowww!
This does not replace our desired final Kraken architecture, or the Request Logging proposal outlined Request Logging. This is meant to be a place to list the ways we have tried to get data into Kraken reliably, and other ways we still have to try.
Pipeline Components edit
Our current goal is to get large UDP webrequest stream(s) into HDFS. There are a bunch of components we can use to build a pipeline to do so.
Sources / Producers edit
- udp2log
- Flume UDPSource (custom)
- Flume Spooling Directory Source
- KafkaProducer Shell (kafka-console-producer)
- Ori's UDPKafka
Agents / Buffers / Brokers edit
- Flume Memory Channel (volatile)
- Flume File Channel
- KafkaBroker
- plain old files
Sinks / Consumers edit
- Flume HDFS Sink
- kafka-hadoop-consumer (3rd party, has Zookeeper support)
- Kafka HadoopConsumer (ships with Kafka, no Zookeeper support)
- plain old cron jobs + hadoop fs -put
Possible Pipelines edit
udp2log -> KafkaProducer shell -> KafkaBroker -> kafka-hadoop-consumer -> HDFS edit
This is our main solution, and works most of the time, but drops data. udp2log and producers are currently running on an03, an04, an05 and an06, and kafka-hadoop-consumer is running as a cronjob on an02.
Flume UDPSource -> HDFS edit
udp2log -> files + logrotate -> Flume SpoolingFileSource -> HDFS edit
udp2log -> files + logrotate -> cron hadoop fs -put -> HDFS edit
UDPKafka -> KafkaBroker -> kafka-hadoop-consumer -> HDFS edit
Storm Pipeline edit
The ideal pipeline is probably still the originally proposed architecture that includes modifying frontend production nodes, as well as using Storm.
Native KafkaProducers -> Loadbalancer (LVS?) -> KafkaBroker -> Storm Kafka Spout -> Storm ETL -> Storm HDFS writer -> HDFS
Links edit
- Kafka Spout - a Kafka consumer that emits Storm tuples.
- Storm State - Storm libraries for continually persisting a collection (map or list) to HDFS. I don't think this would fit our needs for writing to Hadoop, as I believe it wants to save serialized Java objects.
- HDFS API Docs for FileSystem.append()