Analytics/Archive/Infrastructure/Access

This page is obsolete and has been moved to wikitech at This page has moved to wikitech at https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Access


How to access Kraken and crunch your wildest numbers edit

As of December 2012, Hadoop is up and running on 10 fresh and clean Analytics nodes. Come on over and start counting beans!

If you have a shell account, you can ssh into analytics1001.wikimedia.org, and use the Hadoop CLI. But, did you know? There is a web interface!

Hadoop Web UI edit

All of the Kraken web interfaces are hosted from internally accessible hosts. analytics1001 is set up as a reverse proxy to allow access to these hosts. You will have to modify your /etc/hosts file so that you can address each of the services by name (we don't have any public DNS set up yet).

You will be prompted for HTTP authentication credentials if you are not in the WMF office. Ask otto if you need access and don't have this information.

NOTE: The following access instructions are subject to change at any time.


Name Based Proxy edit

Open up your /etc/hosts file and add this line:

 208.80.154.154 analytics.wikimedia.org namenode.analytics.wikimedia.org jobs.analytics.wikimedia.org history.analytics.wikimedia.org oozie.analytics.wikimedia.org hue.analytics.wikimedia.org storm.analytics.wikimedia.org

This aliases analytics1001 to a bunch of hostnames so that the internal proxy rules can figure out which host and port you are actually trying to access.

Now head on over to http://analytics1001.wikimedia.org/. If you are not in the WMF office, you will be prompted for an HTTP auth password. Ask otto for the password. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.


Browser Configured Proxy edit

NOTE: This method is disabled due to security concerns.

Open up your browser preferences and configure these HTTP proxy settings:

 Host: analytics1001.wikimedia.org
 Port: 8085

If you are using FoxyProxy, this set of whitelist regexes will treat you nicely:

 ^https?://analytics.*\.eqiad\.wmnet.*
 ^https?://analytics10\d\d(:\d+)?(/.+)?$

Once that's done, navigate on over to http://analytics1001.wikimedia.org. The links there should guide you to the webservices you are looking for. Hue will probably be most useful for you at first.

Oh, and use the (internal) links on that page, not the main ones. Those link directly to the machine names.


Hue is a general purpose web interface built for the Hadoop ecosystem. Use Hue if you want to easily run and schedule Pig and Hive jobs.

Hue is currently configured to use the Labs LDAP instance. You should be able to log in with your LabsConsole Credentials.

Tutorial edit

There's a great Pig starter tutorial over at Analytics/Kraken/Tutorial. That's a good place to start if you want to try your hand at crunching data using Kraken. We'll add more tutorials there as we gain more experience.