Skip to main content

Inside source code crawler world: the evolution of searching, mining and retrieving of software components

Posted by mayworm on December 19, 2006 at 12:32 PM PST

Software developers are facing a big challenge in today's software development area, which is: developing and delivering, often complex, high quality running software using multiple object-oriented (OO) frameworks, libraries and application programming interfaces (APIs) in very short period of time. This large spectrum of possibilities results in a steep learning curve due to its huge amount of information – classes, methods, interfaces, relations, dependencies and so forth – to be assimilated. At the same time, companies are struggling to fit into their tight schedules while dealing with pressure to reach deadlines and deliver products to market earlier than its competitors. This creates a good opportunity to make use of mechanisms that might improve the quality and productivity of development by using existing software assets.

Fabiano Cruz and I decided to talk about the evolution of code search and how shareable is the source code through the world nowadays. We will split up this post in three or more essays. In this first essay, we are going to classify the evolution of source code search engines. So, please, feel free to drop us a line spreading your opinion about this topic.

It's an ordinary habit, developers ask for a source code sample to others colleagues. For instance: if you need to create a simple JTree, it's easier and faster to ask someone to send a sample code like Listing 1, instead of trying to learn from the API. Who has never asked for a code snippet to a friend?

DefaultMutableTreeNode root = new DefaultMutableTreeNode("Root"); 
DefaultMutableTreeNode child1 = new DefaultMutableTreeNode("Child 1");

DefaultMutableTreeNode child2 = new DefaultMutableTreeNode("Child 2");

JTree tree = new JTree(root);
someWindow.add(new JScrollPane(tree));

Listing 1 - JTree sample code snippet

We could roughly compare the Advanced Research Projects Agency Network (ARPANET) in the beginning of the WWW project without Google Search Engine with the Software Development area without a powerful mechanism to search, mine and retrieve source code samples. Indeed we hadn't 25 billion web pages to index as we did not have all those fancy languages and frameworks in the software development field as we have today.

So, there are several tools available to empower developers with an easy-to-use interface to search for source code snippets and share source code among other developers. For instance, developers can make their source code available to a lot of people through 'forge' repositories, blogs, articles and so on.

In order to classify the evolution of source code search engines, we could say that the First Generation of these tools were the popular Unix command: find (see Listing 2 below). The find command allows the Unix user to process a set of files and/or directories in a file subtree. It can be used to search certain code snippet in a project directory, providing support for coding by "copy-and-paste" or the "develop by example" practice, or you could even call it [reuse].

find . –name *.c -exec grep "hashtable" '{}' \; -print

Listing 2 - The find command

The basic idea is to run the find command on a directory in order to locate a set of files that match a shell pattern or a regular expression. Run 'grep' for each file and look for a certain pattern. Matched lines, including some context, are collected and displayed for visualization. The user may inspect an entire file if he/she wishes.

Analyzing the code search tools evolution, the Second Generation was the traditional search engines which are based on information retrieval technologies. There, the developers could find and retrieve source code samples. Nevertheless, these tools implement techniques such as text relevance and link analysis that can't be integrally applied to source code. So, while the developers are looking for a small source code piece, these tools show plain text as result. Clearly it doesn't apply different concepts and techniques for processing and searching for structured source code, determining the probable intent of a search, leaving to developers the result analysis job. Of course, this type of search has its drawbacks.

Some freshest code search tools (see Table 1 below), and here we could say Third Generation, appreciates the way of thinking to enrich the support and service which will definitely improve the search quality. Currently, the way to automatically retrieve such volumes of source code data is through the use of a focused Web crawler, which is a program that visits many repositories (public source code repositories like and sourceforge) and scans their content, visiting and indexing more source code.

Google Code Search

Table 1 – Source code search engines

There are others requirements to take in account while effectively digging for source code, like: Metadata terms, class name, imports, code block, comment block, method return and others. For instance, a Java source code provides a good way to get information to index and crawler, like anotations, code conventions (standard conventions that Sun follows and recommends - and so forth. We can't forget about Fingerprints, which are an interesting feature provided by some Code Search tools. It denotes both the presence and multiplicity of specific programming constructs within individual code entities. Some code search tools make available several ways of fingerprint search, for example, Micro Patterns in Java Code, which provides information as to whether or not simple design patterns are present within a code entity.

The other special feature available in some code source search engine is a kind of binder, where developers can save search results and share them with others, providing benefits not available in traditional search engines. We can imagine an automatic Agent notifying you when additions are made to your binder.

In the last couple of years source code specialized search have been emerged. These tools support different programming languages, and you can filter by license, besides many others interesting features.

When you ask developers the more excited users of source code search engines what is the benefit of using such tools, they point out "reuse", and a lot of them are Java developers. Indeed, this is explained due to the fact that the Java community is accomplished through the process of sharing and reuse.

Another thing that has been emerging is tools for code search directly integrated into the development environment, providing contextually search capabilities where, depending on what the developer is suppose to do on that part of the code, these tools can query a sample repository for code snippets that are relevant to the programming task at hand. For instance, if you are writing a MD5 method within a class, these tools can infer queries from context, so there is no need for the programmer to write queries once the most structurally relevant snippets are returned to the developer and they can be navigated.

Undoubtedly, tools like that can help large organizations to go beyond reuse barriers, empowering them with a straightforward approach for searching, mining and retrieving software components and source code snippets across its, source code repositories (usually distributed). Also, many software developers can quickly find helpful and useful information for building even more high-scale and pluggable software components, keeping them focused on the problem domain at hand.

Last but not least. We hope you enjoyed the reading, and possibly learned a few things. You are also welcome and encouraged to drop your comments here. In the next post, we’re planning to go deeply in the Third Generation source code crawler world, comparing the existing tools, its strengths and weakness, particularly regarding the Java language.