Posted by yoavl
on December 16, 2010 at 3:16 AM PST
Tracking Artifact Licenses - Why is this Hard?
Tracking licenses of third-party artifacts is not one of those tasks that get developers excited. With more interesting problems to solve than legal issues, it is not usually high on the priority list for most teams to deal with licenses during active development, so more often than not, this is left as one of the final steps before preparing a release.
Even when you do try to take due diligence and track those third party licenses, making sure that all developers verify each dependency and its transitive dependencies for compatibility with your company’s license usage policy is not a trivial thing to do. Eventually this results in manually digging through each and every dependency in the project and attempting to accurately keep track of the license that each dependency uses.
Now, if you are only developing in-house projects, then this may not seem like a big deal, but once you begin distributing your software, even as a cloud service, the risk of using a third party dependency that uses an unwanted license is a reality.
License Information is Out There - Module Info to the Rescue!
Getting the initial license information for third party dependencies doesn’t have to be a manual process - with modular dependencies there is already good information out there that we can leverage!
Maven, Ivy (+Ant), and Gradle (which uses Ivy) all describe artifacts and dependencies in terms of reusable declarative modules. Both Maven POM files and Ivy descriptor files are designed to contain license information as part of the module metadata. And, in fact, many open source libraries already include valuable license information in their descriptors. Potentially, that means that extracting license information from module metadata can be fully automated!
Relying on Module Metadata - Not Quite There Yet...
In practice, there are a couple of issues with purely relying on license information from Java modules:
- License naming zoo - Current Java module systems (POMs/Ivy files) define license information as free-form text with no specific standard - unlike, for example, Python PyPi modules that use a closed list of OSI licenses.
For example, the Apache 2 license, may sometimes appear by its full name ‘The Apache Software License, Version 2.0’, as ‘Apache License, Version 2.0’, or as ‘Apache License V2.0’, etc.
This makes identifying the correct license hard and requires the use of heuristics.
- Missing license data - that requires the ability to manually determine a license for an artifact, rather than relying on auto-discovery.
- Wrong license data - like missing license data, this requires the ability to manually and permanently override an auto-discovered license for an artifact.
- License data may be implicit - for example, the license may reside in a parent POM or in a mixin descriptor. This requires traversal of the module inheritance chain to discover the real license.
- Multiple licenses - it is not uncommon for artifacts to have more than a single license (e.g. CDDL v1.0 and GPL v2). In this case, you need to decide which license is the applicable active one.
Managing Licenses with an Artifact Repository
Many organization already manage their published artifacts and dependencies in a central Artifact Repository, such as JFrog's Artifactory
. The repository keeps all the organization’s binaries which are used by the developers and by the build system.
Apart from managing the binary data itself, Artifactory also manages metadata about artifacts.
Managing license information about artifacts as part of this metadata just seems the natural thing to do:
By using the artifacts repository we can tag our artifacts with license information managed at a central place. Adding this license metadata information can be fully automated and can also be controlled by users!
This is, in fact, exactly approach taken used by the License Control
feature in Artifactory Pro, and it solves all previously mentioned issues related to license information extraction.
This is how it works:
- Artifactory maintains a list of all well-known licenses. Users can extend this list with custom licenses. Each license can be approved or unapproved.
- Each license contains a regular expression used to match against free-form license information inside module files in the repository (the many licenses bundled with Artifactory have been tested and fine-tuned against numerous open source projects).
- Artifacts can be tagged with license information - manually or automatically. Once tagged, this information is reusable. Automatically discovered license information can be overridden by users. Multiple licenses per artifact are supported.
- License discovery and reporting merges automatic license extraction with user-defined license data to initiate license violation alerts.
Discovering Licenses Automatically - Build Servers Never Lie
Automatic license discovery and notifications about possible violations is done as an integral part of the Continuous Integration process -
Whenever a new dependency is introduced by a developer it will get picked up on the next build by triggering automated license analysis. If the dependency is using an unknown or unapproved license an email notification will be sent to specified users.
This is all possible using Artifactory’s comprehensive build integration
with Jenkins (formerly, Hudson)
, JetBrains TeamCity
and Atlassian Bamboo
and works for Maven 2 & 3
builds on each build server.
When installing the Jenkins Artifactory plugin
, for example, you will get the options to run license checks as part of the build (identical functionality exists for TeamCity and Bamboo):
The Full Cycle - From Modules to Automated License Checks
Here is how it all works together to automatically extract and apply licensing information and conduct license violation checks on the fly:
A developer declares new dependencies in pom.xml files or ivy.xml descriptors (1). Once the changes are declared the developer commits them to the Version Control System (2).
The CI build server monitors version controlled files, sees the changes and pulls them to its workspace (3), which triggers a build (4).
The build is run and intercepted by the Artifactory plugin (for the relevant CI server). The data intercepted is a complete BuildInfo for the build (acting as a bill of materials), including information about all resolved dependency artifacts (5).
Note: It is important to realize that the context of a build is the only reliable source of information for the actual dependencies used by your project, since dependency resolution can be dynamic and rely on dynamic aspects like version ranges, the state of the repository at the time of build, resolved properties, etc.
The Artifactory plugin publishes all modules with the captured BuildInfo to Artifactory (6). This is where things start to get interesting -
Artifactory looks at the dependencies and for each artifact attempts to figure out what licenses it uses (7). This is done by combining: license information from module metadata, previously found license information and user-set license information. It is even possible to tell Artifactory the exact build scopes/configuration for which dependencies need to be checked.
At the end of the analysis an email with all license violations discovered is sent out to the configured recipient addresses (8). Normally this would be the development lead or the project lead and not someone from legal.
Although there may be license violations, the build will not fail - This approach allows development to move on naturally, while letting development leads discover possible licensing discrepancies immediately as they surface and deal with them before they become an issue. To submit the information beyond the development circle, you can generate license usage reports (9) to incorporate into the legal department’s favorite Excel template. Effectively what this means is that you never have a single artifact in your project that was not verified for license information prior to submitting it!
Using the power of a central repository manager like Artifactory, we can extract important license information and combine it with user definitions in order to automate the process of license governance. This is done in the context of a project build executed automatically by the CI server upon changes in the version control system. This ensures that all possible license violations are handled immediately when new or modified dependency declarations are checked in.
The approach taken towards license control is developer-oriented - never stop the build, but let development leads decide per new dependency whether it can go into the project, before the information is generated and transferred for legal improvement.
You can read more about the Artifactory License Control feature on the JFrog wiki
, or watch this short video
to see the full cycle described here is action.