Skip to main content

Increasing Performance in search NOSQL Database

Posted by otaviojava on October 13, 2012 at 7:38 PM PDT


    Nowadays there are too many NOSQL database, with different architecture and data structure. However with much variety they share one thing in common: Only search information from the keys. A good option is uses other service to help the NOSQL databases. In this post will how the Lucene does work together a NOSQL database, join two world in one application.
    The apache Lucene is an API for searching and indexation of document, written in Java. Its operation is did in two steps:

The first: Given the text, the Lucene does indexes in document, it transforms the original text in words-key for be search easy and faster.


Last: Find the document from words-key, done it in the first step.
    The biggest advantage to using Lucene is which with him, is high abstraction. The developer does need know about indexation algorithm. The index, from Lucene, may stay in distributed system, so could get more scalability and performance in this tool.

    Talked about Lucene, the next step is join two worlds: NOSQL database and
Lucene. For this will created a sample software its objective is resume's storage , it will follow the sequence bellow:

1 some user register him resume.
2 The analyzes of human resources will search the professional starting
the resume.

3. The professional can be found with all informations in resume:
address, skills, soft-skills, languages, etc.


    The application will be Web, so I gonna use the Java EE platform in latest version, 6.0.

    In Lucene, the indexes are storage from Directory's interface, which has two implementations, for storage in RAM and Hard disc. For get high availability, will use indexation process in RAM memory, but for don't lost the information, the application does backup in hard disk.

For do that is very simple with EJB 3.1 schedule resources, so eventually get what in the memory and put in HD.



@Schedule(minute = "*/1", hour = "*")
  public void reindex() {
  
    try {  
      Directory disco = FSDirectory.open(new File(Constantes.getIndexDirectory()));
      luceneManager.backup(directory, disco);
     
    }
    catch (Exception e) {
     Logger.getLogger(ScheduleService.class.getName()).log(Level.SEVERE,
                null, e);
    }

  }



         This way when the application fall out and get up again, will load all in H.D to memory.

@ApplicationScoped
public class LuceneManager implements Serializable{

private static final long serialVersionUID = -8280220793266559394L;

  @Produces
        private Directory directory;
   

    @Inject
    public void init() {
        directory = new RAMDirectory();
        try {
            levantarServico();
        } catch (IOException e) {
         Logger.getLogger(LuceneManager.class.getName()).log(Level.SEVERE,
                    null, e);
        }
    }

    public void levantarServico() throws IOException {
        Directory disco = FSDirectory.open(new File(Constantes.getIndexDirectory()));
        backup(disco, directory);
    }

    public void backup(Directory deDiretorio, Directory paraDiretoria) throws IOException {

        for (String file : deDiretorio.listAll()) {
            deDiretorio.copy(paraDiretoria, file, file); // newFile can be either file, or a new name
        }
    }

}



The next step is create the index responsible to communication between application and nosql database. The Lucene has the hierarchy bellow:

The index consists in Document, Document consists fields which in turn consist the information. The structure storage in Lucene can be:

  • Storage and not indexed: ( The key in NOSQL data base is more probable, because
    it needs storage in Lucene and in original way, for in the future, recover all field in base).
  • Field not storage and not indexed: Information that needs be meet in precise way ( like document number) or very short (like sex with F for female and M for Male).
  • Fields not storage and indexed: Information like large text (book, for example) or search some portion of the text. The index process in Cassandra consist in remove accentuations and connectives ( pont, comma, 'and', etc.), gets the radical of the words ( engineering and engineer have engineer like radical ) and other things for leave the
    search faster, but is not necessary storage the information because
    that is in base.




  
  private Document criarDocumento(Pessoa pessoa) throws IOException {
    Document document = new Document();

    document.add(new Field(Constantes.ESTADO_INDICE,pessoa.getEndereco().getEstado(), Store.YES,
        Index.NOT_ANALYZED_NO_NORMS));
    document.add(new Field(Constantes.ID_INDICE,pessoa.getNickName(), Store.YES,
            Index.NOT_ANALYZED_NO_NORMS,TermVector.WITH_POSITIONS_OFFSETS));

    document.add(new Field(Constantes.TUDO, getConteudoCurriculo(pessoa), Store.NO,
        Index.ANALYZED));

    return document;
  }



The lifecycle of the application will:

  • The sequence has simple process
  • The user add the information, then send to server
  • The information is persisted in Cassandra
  • The information is indexed in Lucene also storage the ID.
  • When the Human resource analyst does research, will find a document from Lucene, next, get the key which finally recover all information in Cassandra.

























Done! In this way both, insertion and search for data, will be faster, the Cassandra has secondary indexes that may uses to seeks beyond the keys for search, but do it makes the search slower, than use only key. An other form to get search quickly is using cache.
  In this article was talked about a problem in that there are in majority NOSQL, the searh for fields beyond the key. For solve this problem was showed a work together with Lucene.

Source: http://softwarelivre.org/otagonsan/codigofonte/cassandra-lucene.rar