Tuesday, October 22, 2013

Introducing Teuta, laughingly simple dependency injection container in Clojure

It was year 2002, when I tried my first dependency injection container in Java (these were mostly called Inversion-of-Control containers then). It was one of Apache Avalon subprojects, namely Fortress (beside ECM, Merlin and some others). Before it, I designed my applications in any custom way I saw fit, and sometimes there wasn't much design at all, so that moment really felt enlightening. I know it sounds silly now because these containers are so common now in all mainstream languages, but back then, it really took quality of my apps to whole new level, and I felt I could comprehend my code much more easily.

Now it's 2013, and destiny took me to Clojure language. I'm still fresh to it, but what I noticed is there isn't much info around about structuring the applications, as if namespaces and vars contained in them are sufficient for anything. If there wasn't few presentations from Stuart Sierra or Prismatic team, I would probably go on thinking it must be an issue with my OO legacy. Fortunately, after these talks, I could see there is a real need for some kind of componentization, and although there are some libraries out there such as Prismatic Graph or Jig, they are somewhat different from what Java programmers are used to, so I decided to write my own, especially because it's so dead-simple idea. The final result is small GitHub project called Teuta.

Library Dependencies

Add the necessary dependency to your Leiningen project.clj and require the library in your ns:
[vmarcinko/teuta "0.1.0"] ; project.clj

(ns my-app (:require [vmarcinko.teuta :as teuta])) ; ns

Container Specification

Anyway, to create a component container, we have to start by defining a specification, and it is simply a map of entries [component-id component-specification]. Component ID is usually a keyword, though String or some other value can be used. Component specification is vector of [component-factory-fn & args], so a component can be constructed later, during container construction time, by evaluating factory function with given arguments. So you see, this is just an ordinary function, and a component can be constructed in any arbitrary way, though maybe most usual way would be to use records and their map factory functions which are very descriptive. If a component depends upon some other component, then it should be configured to use it. Referring to other components is done via
(teuta/comp-ref some-comp-id)
If components form circular dependencies, exception will be reported during container construction time. Similarly, if we want to parametrize some piece of component configuration, then we simply do that via:
(teuta/param-ref some-param-id-path)
So, specification would look something like:
{:my-comp-1 [mycompany.myapp/map->MyComp1Record 
             {:my-prop-1  "Some string"
              :my-prop-2  334
              :my-prop-3  (teuta/param-ref :comp-1-settings :some-remote-URL)
              :comp2-prop (teuta/comp-ref :my-comp-2)}]
 :my-comp-2 [mycompany.myapp/map->MyComp2Record 
             {:my-prop-1 6161
              :my-prop-2 (atom nil)
              :my-prop-3 (teuta/param-ref :comp-2-settings :admin-email)}]}
Since whole specification is simply a regular map, it is useful to have some common map containing always present components, and have separate profile-specific maps with components for production, test, development... That way you simply merge those maps together to construct desired final specification.

Container Construction

Once we have our specification, we can simply create a container by calling
(def my-container (teuta/create-container my-specification my-parameters))
The container is just a sorted map of [component-id component] entries. When the container map is printed, in order to make it a bit more clear, referred components will be printed as << component some-comp-id >>.

Since whole application state is also contained in this container map, this means it plays nicely with Stuart Sierra "reloaded" workflow.

Component Lifecycle

If a component's functions depend upon some side-effecting logic being executed prior to using them, then a component can implement vmarcinko.teuta/Lifecycle protocol. The protocol combines start and stop functions which will get called during starting and stopping of a container.
(defprotocol Lifecycle
  (start [this] "Starts the component. Returns nil.")
  (stop [this] "Stops the component. Returns nil."))
Container is started by:
(teuta/start-container my-container)
Components are started in dependency order. If any component raises exception during startup, the container will automatically perform stopping of all already started components, and rethrow the exception afterwards. Likewise, stopping of container is done via:
(teuta/stop-container my-container)
If any component raises exception during this process, the exception will be logged and the process will continue with other components.

Example

Here we define 2 components - divider and alarmer.

Divider takes 2 numbers and returns result of their division. Let's define working interface of the component as protocol, so we can allow many implementations.
(ns vmarcinko.teutaexample.divider)

(defprotocol Divider
  (divide [this n1 n2] 
  "Divides 2 numbers and returns vector [:ok result]. 
  In case of error, [:error \"Some error description\"] will be returned"))
Unlike this example, component interfaces will mostly contain multiple related functions. Request-handler components, such as web handlers, usually don't have a working interface since we don't "pull" them for some functionality, they just need to be started and stopped by container, thus implement Lifecycle protocol. Default implementation of our divider component will naturally return the result of dividing the numbers, but in case of division by zero, it will also send notification about the thing to alarmer component (by calling vmarcinko.teutaexample.alarmer/raise-alarm). Placing component implementation in separate namespace is just a nice way of separating component interface and implementation.
(ns vmarcinko.teutaexample.divider-impl
  (:require [vmarcinko.teutaexample.alarmer :as alarmer]
            [vmarcinko.teutaexample.divider :as divider]
            [vmarcinko.teuta :as teuta]))

(defrecord DefaultDividerImpl [alarmer division-by-zero-alarm-text]
  divider/Divider
  (divide [_ n1 n2]
    (if (= n2 0)
      (do
        (alarmer/raise-alarm alarmer division-by-zero-alarm-text)
        [:error "Division by zero error"])
      [:ok (/ n1 n2)])))
Alarmer is defined as follows:
(ns vmarcinko.teutaexample.alarmer)

(defprotocol Alarmer
  (raise-alarm [this description] "Raise alarm about some issue. Returns nil."))
Default implementation of alarmer "sends" alarm notifications to preconfigured email addresses. For this example, sending an email is just printing the message to stdout. It also prints alarm count, which is mutable state of this component, and is held in an atom passed to it during construction. Atom state is initialized and cleaned up during lifecycle phases - start and stop.
(ns vmarcinko.teutaexample.alarmer-impl
  (:require [vmarcinko.teutaexample.alarmer :as alarmer]
            [vmarcinko.teuta :as teuta]))

(defrecord DefaultAlarmerImpl [notification-emails alarm-count]
  alarmer/Alarmer
  (raise-alarm [_ description]
    (let [new-alarm-count (swap! alarm-count inc)]
      (println (str "Alarm Nr." new-alarm-count " raised: '" description "'; notifying emails: " notification-emails))))
  teuta/Lifecycle
  (start [_]
    (reset! alarm-count 0))
  (stop [_]
    (reset! alarm-count nil)))
So let's finally create container specification and wire these 2 components. We will also extract alarmer email addresses as application parameters.
(def my-parameters {:alarmer-settings {:emails ["admin1@mycompany.com" "admin2@mycompany.com"]}})

(def my-specification
  {:my-divider [vmarcinko.teutaexample.divider-impl/map->DefaultDividerImpl
                {:alarmer                       (teuta/comp-ref :my-alarmer)
                 :division-by-zero-alarm-text   "Arghhh, somebody tried to divide with zero!"}]

   :my-alarmer [vmarcinko.teutaexample.alarmer-impl/map->DefaultAlarmerImpl
                {:notification-emails   (teuta/param-ref :alarmer-settings :emails)
                 :alarm-count           (atom nil)}]})
Now we can construct the container, start it and try out dividing 2 numbers via divider component.
(def my-container (teuta/create-container my-specification my-parameters))

(teuta/start-container my-container)

(vmarcinko.teutaexample.divider/divide (:my-divider my-container) 3 44)
=> [:ok 3/44]

(vmarcinko.teutaexample.divider/divide (:my-divider my-container) 3 0)
=> Alarm Nr.1 raised: 'Arghhh, somebody tried to divide with zero!': notifying emails: ["admin1@mycompany.com" "admin2@mycompany.com"]
=> [:error "Division by zero error"]
In order to call vmarcinko.teutaexample.divider/divide function "from outside", we needed to pick divider component from the container first. But if request-handling piece of application is also a component in container, as could be the case with some web handler serving HTTP requests to our vmarcinko.teutaexample.divider/divide function, then container specification will handle wiring specified divider component. Let's create such a web handler component using popular Jetty web server:
(ns vmarcinko.teutaexample.web-handler
  (:require [ring.adapter.jetty :as jetty]
            [vmarcinko.teuta :as teuta]
            [ring.middleware.params :as ring-params]
            [vmarcinko.teutaexample.divider :as divider]))

(defn- create-handler [divider]
  (fn [request]
    (let [num1 (Integer/parseInt ((:params request) "arg1"))
          num2 (Integer/parseInt ((:params request) "arg2"))
          result (nth (divider/divide divider num1 num2) 1)]
      {:status 200
       :headers {"Content-Type" "text/html"}
       :body (str "<h1>Result of dividing " num1 " with " num2 " is: " result " </h1>")})))

(defn- ignore-favicon [handler]
  (fn [request]
    (when-not (= (:uri request) "/favicon.ico")
      (handler request))))

(defrecord DefaultWebHandler [port divider server]
  teuta/Lifecycle
  (start [this]
    (reset! server
      (let [handler (->> (create-handler divider)
                         ring-params/wrap-params
                         ignore-favicon)]
        (jetty/run-jetty handler {:port port :join? false}))))
  (stop [this]
    (.stop @server)
    (reset! server nil)))
Jetty server is held in an atom, and is started on configured port during lifecycle start phase. As can be seen, divider component is the only dependency of this component, and request URL parameters "arg1" and "arg2" are passed as arguments to vmarcinko.teutaexample.divider/divide function. We added also favicon request ignoring handler to simplify testing it via browser. This component requires popular Ring library, so one needs to add that to project.clj as:
:dependencies [[ring/ring-core "1.2.0"]
               [ring/ring-jetty-adapter "1.2.0"]
               ...
Let's expand our specification to wire this new component.
(def my-parameters { ...previous parameters ...
                    :web-handler-settings {:port 3500}})

(def my-specification
  { ....previous components ....
   :my-web-handler [vmarcinko.teutaexample.web-handler/map->DefaultWebHandler
                    {:port (teuta/param-ref :web-handler-settings :port)
                     :divider (teuta/comp-ref :my-divider)
                     :server (atom nil)}]})
Now, after the container has been started, we can try out HTTP request: 
http://localhost:3500?arg1=3&arg2=44
Division result should be returned as HTML response. Division with zero should print alarming message to REPL output.

Wednesday, October 2, 2013

Neo4j model for (SQL) dummies

In general, one characteristic of the mind is that it has hard time grasping new concepts if these are presented without comparison to some familiar ones. And I experienced that when trying to explain Neo4j data model to people who are stumbling on it for first time. Mostly they are confused by lack of schema, because when visualized, those scattered graph nodes, connected into some kind of spider web, bring confusion into minds so long accustomed to nicely ordered rectangular SQL world.

So what seemed to work better in this case is just to describe it using all too familiar RDBMS/SQL model and its elements: tables, columns, records, foreign keys ...In other words, let's try to describe Neo4j-graph model as it would be if built on top of SQL model.

Actually, this is quite easy to do. We just need 2 tables, and let's call them "NODES" and "RELATIONSHIPS". Both reflect 2 main elements in Neo4j model - graph nodes and relationships between them.

"NODES" table

This one would be where entities are stored, and it contains 2 columns - "ID" and "PROPERTIES".

ID PROPERTIES
334 {"name": "John Doe", "age": 31, "salary": 80000}
335 {"name": "ACME Inc.", "address": "Broadway 345, New York City, NY"}
336 {"manufacturer": "Toyota", "model": "Corolla", "year": 2005}
337{"name": "Annie Doe", "age" 30, "salary": 82000}

PROPERTIES column stores map-like data structure containing arbitrary properties with their values. Just for purpose of presentation, I picked JSON serialization here. So you see, due to this schema-less design, there are no constraints upon what properties are contained in the PROPERTIES column - which is actually the only practical/possible way since all entity types (department, company, employee, vehicle...) are stored in this single table.

"RELATIONSHIPS" table

This table would contain "ID", "NAME", "SOURCE_NODE_ID", "TARGET_NODE_ID" and "PROPERTIES" columns, and purpose is to store associations between nodes. We can say that records stored here represent schema-less version of SQL foreign-keys.

ID NAME SOURCE_NODE_ID TARGET_NODE_ID PROPERTIES
191 MARRIED_TO 334 337 {"wedding_date": "20070213"}
192 OWNS 337 336
193 WORKS_FOR 337 335 {"job-position": "IT manager"}

Relationship's NAME marks its "type", and we can add new association "types" into the system dynamically, just by storing new relationship records with previously non-existing names, whereas in SQL database, we need to pre-define available foreign keys upfront.

Since relationships usually have a direction (though they can be bi-directional also in Neo4j), thus we have "SOURCE_NODE_ID" and "TARGET_NODE_ID" foreign keys, pointing to respective NODES. Direction is mainly valuable for its semantic purpose.

Similar to NODES table, here we also have PROPERTIES column to store additional information about association - in SQL world we would need to introduce "link" table to store this kind of data.

Recap

Having no schema brings well known trade-off to the table. On one hand, the structure of such system is less obvious, and special care has to be taken not to corrupt the data, but on the other hand, given flexibility can be exploited for domains that are rich and rapidly changing. And of course, since there are no constraints imposed by database here, it means that application now is solely responsible for correctness of stored data.

Monday, September 16, 2013

Referencing non-indexed Neo4j entities in service layer

"Idiomatic" way to index entities in Neo4j is to do that only on few types of them, usually the ones that are most often used, or for some reason are the most practical to be accessed directly. Of course, top level entities (such as Company or User in some business domains) just have to be indexed since they cannot be fetched via some other entity.

So, let's say we have 2 types of entities - Company and Department, and they are in one-to-many relationship. Company would have to be indexed, but Department would not because it can be traversed to starting from the parent Company. This fetching via traversal is actually one of best selling points of Neo4j because the speed of that operation generally doesn't depend upon size of whole dataset, unlike SQL databases that have to perform JOIN-ing of different tables which involves tackling with their indexes and performance of that ultimately depends upon table size.

Anyway, all seems good, but it can have some impact on your service layer.

Until now, when you had your Department entities indexed, you had some service layer operation  with only one argument needed to reference the entity:

 public interface DepartmentManager {  
  void activateDepartment(UUID departmentUuid);  
 ...  
 }  

And now we must introduce another argument to identify parent Company to be able to traverse the graph to Department in question.

 public interface DepartmentManager {  
  void activateDepartment(UUID companyUuid, UUID departmentUuid);  
 ...  
 }  

Of course, one can argue that we could decide to index Department entities also to simplify accessing them, but then this same reasoning can lead us to index almost all types of entities that we want to operate on at service layer, and we surely want to avoid that for reasons described in the beginning of this post.

Friday, September 6, 2013

Neo4j and beauty of role-based entity referencing

Polyglot persistence is all the rage now, and one of more exotic types of databases around are graph DBs, so we decided to give it a shot for a part of larger system. We picked Neo4j. Even if we were not a Java shop, we would probably stumble on it anyway since it definitely looks the most popular graph database right now.

After working with it for some time I noticed a thing that I really like - object-graph mismatch is much lower than object-relational one. Although there are numerous things where object-relational mismatch shows its face, one of things that bothered me the most is that I always had to take good care of what type/role I will be referencing some object with.

In Java land, even with as poor meta-model as it has (compared to some other more exotic languages out there), we can reference some entity from another one by many ways - using class, subclass or interface, And we all know that one of great principles of good OO design is to reference objects by their role, which can be expressed in any of mentioned language constructs. Interface, if sufficient, is usually the most preferred way to express an object role.

Here's an example ...

Let's say we have a User class that has reference to its owner entity, described by UserOwner interface. This UserOwner interface is the role that the owner entity plays in that case.

 public class User {  
   private String name;  
   private UserOwner owner;  
 ....  
 }  

And let's say this UserOwner role can be played by multiple different entities - company and department. Let's even say that users themselves can be the owners of other users. If we were to express this in Java, we would implement this UserOwner interface by many classes:

 public class User implements UserOwner {  
 ...  
 }  
 public class Company implements UserOwner {  
 ...  
 }  
 public class Department implements UserOwner {  
 ...  
 }  

So how would we map this case to SQL world? We would have USERS table, but also COMPANIES and DEPARTMENTS table.

And to express the reference from user to its owner, we would need to have a foreign key that points from USERS table to .... to.... to what? We don't have a concept in SQL world that would "mark" USERS, DEPARTMENTS and COMPANIES tables belonging to some "USER_OWNERS" type so we could define the foreign key by that target. Problem is that SQL meta-model is still much poorer compared to OO meta-model, and it doesn't have a concept of supertables (for hierarchies of  tables), or some other concept that would mark records as being of multiple types.

In Neo4j it is straightforward - unlike SQL database, it is schema-less so we don't burden ourselves with types, we just have a special relationship type (named let's say "BELONGS_TO") that corresponds to association between a User and its UserOwner.
























You see how different users reference their owners via BELONGS_TO relationship, regardless if that entity is company, department or other user. Now you can write simple Cypher queries such as this one which without any fuss fetches the owner of some entity:

 START user=node(<someUserId>) MATCH user-[:BELONGS_TO]->owner RETURN owner;  

In application layer, we would cast result of that query to UserOwner object and do with it whatever that role allows us (via methods on that interface).

Sweet!

Tuesday, July 2, 2013

Gradle thinks overloading is so cool

In Java, it all started with Ant. Then came Maven, but strangely, somehow I always ended up working on projects that didn't use it. And finally we have Gradle.

I am by no means any expert in Java build tools, so maybe there are some additional ones out there, but I'm pretty sure these are the 3 crucial ones. Anyway, me and my coworkers heard many praises about Gradle, so after using Ant (+Ivy) for many years, we picked it as our next build tool.

We've been using it for some time throughout the company, and although I haven't been primarily dedicated to it, every now and then, I had to occasionally change something in the scripts. Gradle is definitely big improvement over Ant which is too low-level and doesn't provide any of higher-level concepts so frequently found in (Java) development.

But, I immediately noticed something which bothered me...

I dunno if this is somewhat caused by Groovy language itself, which I never studied, but Gradle has overly large number of ways to do the same stuff. Naive person would think that this is good thing, because if you have some objections on one way of doing things, you can always pick another. Really my friend, is that so?

The point is to expose the smallest possible number of paths to a goal, because a developer cannot remember and master too many of them, and in the same time, to give maximum possible power through them. Some ingenuity is required to develop such technology of course, and none is perfect, so I don't mind any compromises, but I noticed that people like to talk often only about large number of features, like there isn't any drawback in that.

One could say - well, nobody forces you to remember every way of doing some thing. That's simply not true. Over any proper span of time, you will have to work on your coworker's code sooner or later, and if he has chosen some other style, you are forced to learn his way. Eventually, you will have to learn most of approaches to feel comfortable when working with the tool. Its not just about your coworkers - when we learn new stuff, or stumble upon some problem, the preferred way of doing things is to Google the problem and copy-paste some solution from the web. And of course, there is no guarantee that this piece of code has picked your way of doing things.

Monday, June 17, 2013

Submitting Hadoop jobs programmatically from anywhere

If you are newcomer to Hadoop land, have you noticed that most of examples for submitting Hadoop MapReduce jobs out there use shell command - "hadoop jar ..."?

Although this way comes handy, maybe even quite often, I personally would prefer if default, most visible way, would always be about how to do things programmatically because there are various environments where one wants to submit jobs from, and if you know how to code it, that you're always good to go.

Take these situations for example:
  • code-run-debug cycle is always best done in your IDE, but unfortunately various IDEs have their own way of doing things, and they often don't package our code in appropriate JARs
  • you would like to submit jobs from your non-Hadoop client machine to remote Hadoop cluster - in other words, you don't even have "hadoop" shell command available on your client
To be honest, it's not that these programmatic examples are rare if you try to google them, but more like being too simplistic for real-world scenarios - eg. they often just give "word count" example that doesn't use any 3rd party libraries, which if you really think about it, is not that useful at all, since any serious Hadoop developer use some additional library/tool that offers higher-level abstraction for job construction.

The problem

The thing is is that submitting jobs to map/reduce involves providing somehow required classes that constitute job logic to remote Hadoop cluster. These classes have to be also available on client side when submitting the job, and the way we package them for client application can often be totally unsuitable for submission to Hadoop cluster. For example - these classes can be deployed in non-packaged form to /WEB-INF/classes/ directory if client application is standard web application).

There are couple of ways to provide required classes to MR job:
  1. Specify path to local JAR file - JobConf.setJar(String path)
  2. Specify example class and let library figure out what local JAR file is in play - JobConf.setJarByClass(Class exampleClass)
  3. Specify local JAR files in "tmpjars" property of job configuration and, each time for each individual job, Hadoop job client will automatically copy it to remote HDFS and add them to distributed cache (this is programmatic equivalent of what -libjars option in hadoop shell commands does)
  4. Copy explicitly required JAR files to HDFS during client application boot time, and after that add them to distributed cache using DistributedCache.addFileToClassPath(hdfsJarPath, jobConfiguration) prior to any job submission; this is similar to 3rd option, but we do the steps manually
All seems well, but let's dive into these options...

The 1st and 2nd option only submit single JAR file which is not enough if you use some higher-level library for constructing Hadoop MR jobs (such as Cascading), so you usually need many JAR files as well as your custom job classes. That's the reason a lot of people package together all of these into single "fat jar" and provide it to Hadoop. Your custom classes can go directly to this JAR file, whereas 3rd party libs can go to /lib directory inside it.

<some fat jar>
    my.company.myapp.MyClass1
    my.company.myapp.MyClass2
    ....
    lib/
        lib1.jar
        lib2.jar
        ....

The 2nd option can be problematic since Hadoop client will try to resolve local JAR by introspecting client application classpath, and as I said before, client application can require different deployment and packaging that cannot use mentioned fat jar. For example :
  • IDEs often run the applications by putting custom classes in plain non-archived form, so we don't even have these as JARs
  • Web applications builds its classpath by using JARs residing in its /WEB-INF/lib directory, and if we just place "fat jar" in it, web app will not find 3rd party JARs contained in its internal /lib directory
Using "fat jar" has some slight overhead in that we always send whole jar to Hadoop for each job we have, and you usually have a lot of jobs during any slightly more complex workflow.

3rd and 4th option use distributed cache mechanism that takes care for 3rd party libraries to be available to submitted jobs during execution. This is also recommended way by Cloudera. If your provide all (even your custom) classes as these 3rd party JARs, you can even even omit specifying your own JAR using JobConf.setJar(...) and JobConf.setJarByClass(...), although Hadoop client will still log a warning because of it.

I somehow prefer the 4th option, because it gives me the most control, so we'll primarily focus on that option.

The solution

All of this leads us to idea that we should maybe have job logic packaged in 2 separate ways: one for client application and one for Hadoop jobs. For example, in case of client being web application, you still place your classes in WEB-INF/classes or WEB-INF/lib as you usually do, but have some separate directory containing these classes packaged in a way suitable for deploying to Hadoop cluster. Seems redundant, but at least it doesn't bring any collision between these 2 worlds.

Preparing map reduce dependencies

Since nowadays the standard way of using 3rd party libs in java applications is via Maven repositories, one would usually like to isolate job dependencies and prepare them for deployment to Hadoop. I use Gradle build tool, so I'll show how you can achieve that.

First, we define separate dependency configuration for Map/Reduce dependencies. Default "compile" configuration should extend it.

 configurations {  
   mapreduce {  
     description = 'Map reduce jobs dependencies'  
   }  
   compile {  
     extendsFrom mapreduce  
   }  
 }  

Now you can define dependencies inside this new configuration. For eg. if you're using Cascading library, you can specify it like:

   mapreduce (  
       "cascading:cascading-core:${cascadingVersion}",  
       "cascading:cascading-local:${cascadingVersion}",  
   )  
   mapreduce ("cascading:cascading-hadoop:${cascadingVersion}") {  
     exclude group: "org.apache.hadoop", module: "hadoop-core"  
   }  
   compile (  
       "org.slf4j:slf4j-api:${slf4jVersion}",  
       "org.slf4j:jcl-over-slf4j:${slf4jVersion}",  
       "org.slf4j:log4j-over-slf4j:${slf4jVersion}",  
       "ch.qos.logback:logback-classic:1.0.+",  
       "org.apache.hadoop:hadoop-core:${hadoopVersion}",  
       "commons-io:commons-io:2.1"  
   )  
   
As you can see, I excluded "hadoop-core" from "mapreduce" configuration since it is provided in Hadoop cluster, but I had to include it again in "compile" since I need it in client application classpath. This is somewhat the equivalent of "Provided" scope that some IDEs offer as a way to flag the libraries.

Now, if you don't like "fat jar" and prefer separate JARs, we would need to copy these "mapreduce" dependecies' JARs to designated directory we do:

 task prepareMapReduceLibsInDir(type: Sync, dependsOn: jar) {  
   from jar.outputs.files  
   from configurations.mapreduce.files  
   into 'mapreducelib'  
 }  

Now we have these dependencies available in separate directory ("mapreducelib"). Our custom classes (jar.outputs.files) are also packaged here as JAR file named by archivesBaseName build script property.

All modern IDEs have a way to include the build task to its development cycle, so you should call this Gradle task whenever you change your custom map/reduce classes or the collection of 3rd party libraries.

Job submission helpers

General description of job submission as described previously in 4th option is:
  • first, copy all required local libraries to some specific HDFS directory
  • prior to any job submission, add these HDFS libraries to distributed cache, which ultimately results with Configuration instance being prepared for jobs
  • use the Configuration in your job client
All utility methods for job submission are contained in JobHelper class, available on Gist.

Let's go through each of mentioned steps.

Copy all local libraries to HDFS

Local directory containing JARs is the one that we constructed via Gradle prepareMapReduceLibsInDir task

String localJarsDir = "./mapreducelib";
String hdfsJarsDir = "/temp/hadoop/myjobs/mylibs";
JobHelper.copyLocalJarsToHdfs(localJarsDir, hdfsJarsDir, new Configuration());

As said, this should be done only once at client application's boot time.

Add HDFS libraries to distributed cache

Now we need to add copied JAR files to distributed cache, so these would be available to Hadoop data nodes which are running job tasks.

Configuration configuration = new Configuration();
JobHelper.hackHadoopStagingOnWin();
JobHelper.addHdfsJarsToDistributedCache(hdfsJarsDir, configuration);
I called additional method that is required for job submission to work correctly on Windows. This piece of code mostly came from Spring Data for Hadoop library. If your client application is running on *nix, you can freely omit that. 

Use prepared Configuration

Now finally we can use prepared Configuration instance from prior step in our JobConf/JobClient:
JobConf jobConf = new JobConf(configuration);
....
JobClient.runJob(jobConf);

If you use some higher level library that doesn't use Configuration instance directly, but expose some other way to pass Hadoop Configuration properties to it, just extract these properties from Configuration and pass it that way. For example, Cascading library allows properties' Map to be passed to it, so you can use following method for constructing the Map:
Map configurationProperties = JobHelper.convertConfigurationToMap(configuration);

And that's pretty much it. Although there can be many more little things that will give you headaches when submitting the jobs, I hope this post expanded your arsenal of possible ways to tackle them.