Analytics/Kraken/Hadoop Tools

This page is meant as a bucket for tips and notes on using the Hadoop toolchain.


Pig is a dataflow language adept at analyzing unstructured data, converting it into a regular structure.

Best Practices

edit
  • Push filters up even if you have to reprocess fields.
  • Drop all unneeded fields as soon as you can.
  • Syntax-check your script locally; use DESCRIBE to understand data shape.
  • Don't explicitly set PARALLEL unless you have a reason. It only effects reduce operations, and Pig has heuristics based on data size for that; you are unlikely to help performance, and forcing PARALLEL higher than necessary uses up more slots on and increases job overhead.
  • Don't Be Afraid To:
    • ...Run stupid, half-broken jobs from a grunt shell just to see the data shape or test your UDF-tinkertoys. (Just pls save to your home dir.)
    • ...DUMP after a complex statement to check output shape.
    • ...Use Hive to check your data; its aggregation and filtering features are way more advanced and SQL is easy.

Reuse & Metaprogramming

edit
    • Macros aren't really all that interesting as they're severely limited in scope.
    • Be careful with exec and run -- they're more flexible and powerful, but very expensive, as every call spawns more MR jobs!
    • Parameters are inserted via literal string substitution, which lets you do some pretty wacky metaprogramming via workflows. This means you basically have meta-macros. See rollup.pig as an example.

Gotchas

edit
  • MATCHES RegExp must hit whole input
  • Don't expect relations to stay sorted -- derived relations after a reduce-op (namely, GROUP) need re-ordering!
  • Implicit coercion from bag of single-tuple to scalar ONLY works for relations! -- it does NOT work for grouped-records or other actual bags
  • Escaping quotes in a parameter is impossible. I swear it. It's is worse than several layers of `ssh box -- bash -lc "eval '$WAT'"`. I gave up; no combination of backslashes and quotes made any difference.


Oozie

edit

Best Practices

edit
  • Test each layer, work outward; I always make a $JOB-wf.properties to test the workflow alone before moving on to the coord (with a test-$JOB-coord.properties and a $JOB-coord.properties).
  • Everything can formally declare parameters using a <parameters> block at the beginning. DO IT! and avoid pointless defaults -- better to fail early.
  • Check the xmlns on your root element!
    • Coordinators: xmlns="uri:oozie:coordinator:0.4"
    • Workflows: xmlns="uri:oozie:workflow:0.4"

Workflows

edit
  • Know your <action> options: control flow, sub-workflow, fs, shell, java, streaming all have uses!
  • Sub-workflows are like functions -- compose and reuse!
  • <prepare> should probably be in all our jobs -- delete the output dir before starting work to ensure you don't pointlessly fail due to temporary cruft.
  • <globals> allows you to set properties for all actions. All jobs should set job-tracker and namenode here.
  • job.xml(s) -- they cascade -- will be useful once we start profiling and tuning jobs. Save those tweaks together as job-confs for similarly structured jobs to reuse!

Coordinators

edit
  • <dataset> initial instances should always predate the job. This only restricts the possible valid results; it doesn't dictate anything about where the job starts.
  • Always create coordinator parameters for jobStart, jobEnd, jobName, jobQueue! This lets you easily fire off the job as a backfill in a different queue, or one-of instances of the job, etc.
  • datasets.xml lets you share <dataset> definitions. It's is worth investigating as the number of jobs grows.
  • Chaining datasets between coordinators is fussy. I haven't seen it worth the energy so far.

Gotchas

edit
  • Some workflow action elements are order-sensitive (!!). Ex: <configuration> must come before <script> in <pig-action>, and yes, the error message is oblique and unhelpful.


Hadoop

edit
  • All jobs keep performance counters and stats. These can be extremely helpful to improve job speed.
  • Be familiar with the hdfs shell tool -- it's a lot more expressive than you might expect.


Tutorials and Guides

edit