Friday, October 29, 2010

Natural language processing in Clojure, Go and Cython

I work in natural language processing, programming in C# 3.5 and Python. My work includes classification, named entity recognition, sentiment analysis and information extraction. Both C# and Python are great languages, but I do have some unmet needs. I investigated if there are any new languages that would help. I only looked at minimal language that would be simple to learn. The 3 top contenders were: Clojure, Go and Cython. Both Clojure, Go have innovative approaches to non locking concurrency. This is my first impression of working with these languages.

For contrast let me start by listing the features of my current languages.

C# 3.5

C# is an advanced object orientated / functional hybrid language and programming platform:
  • It is fast
  • Great development environment
  • You can do almost any tasks in it
  • Great database support with LINQ to SQL
  • Advanced web development with ASP.net
  • Advanced GUI toolkit with WPF
  • Good concurrency with threading library
  • Good MongoDB library
Issues
  • Works best on Windows
  • Not well suited for rapid development
While many features of C# are not directly related to NLP they are very convenient. C# has some NLP libraries: SharpNLP is a port of OpenNLP from Java. Lucene has also been ported. The ports are behind the Java implementation, but still give a good foundation.

Python

Python is an elegant scripting language, with a strong focus on simplicity.
  • NLTK is a great NLP library
  • Lot of open source math and science libraries
  • PyDev is a good development environment
  • Good MongoDB library
  • Great for rapid development
Issues
  • It is interpreted and not very fast
  • Problems with GIL based threading model

    C# vs. Python and unmet needs

    I was not sure what language I would prefer to work with. I suspected that C# would win out with all it advanced features. Due to demand for fast turnaround, I ended up doing more work in Python, and have been very happy with that choice. I have a lot of scripts that can be piped together to create new applications, with the help of the very fast and flexible MongoDB.

    I do have some concerns about Python moving forward:
    • Will it scale if I get really large amount of text
    • Will speed improve on multi core processors
    • Will it work with cloud computing
    • Part of speech tagging is slow


    Java

    Java is a modern object oriented language. Like C# it is a programming platform:
    • Has most NLP libraries: OpenNLP, Mahout, Lucene, WEKA
    • It is fast
    • Great development environment: Eclipse and NetBeans
    • You can do almost any tasks in it
    • Great database support with JDBC and Hibernate
    • Many web development frameworks
    • Good GUI toolkit: Swing and JavaFX
    • Good concurrency with threading library
    Issues
    • Functional style programming is clumsy
    • Working with MongoDB is clumsy
    • Java code is verbose

    I would not hesitate using Java for NLP, but my company is not a Java shop.

    Clojure

    Clojure was released in 2007. It is a right sized LISP. Not very big like Common LISP or very small like Scheme.
    • Gives easy access to Java libraries: OpenNLP, Mahout, Lucene, WEKA, OpinionFinder
    • Innovative non locking concurrency primitives
    • Good IDEs in Eclipse and NetBeans
    • Easy to work with
    • Code and data is unified
    • Interactive REPL
    • LISP is the classic artificial intelligence language
    • If you need speed you can write Java code
    • Good MongoDB library
     Issues
    • The IDE is not working as well as IDEs for Java or C#

      Clojure is minimal in the sense that it is build on around 10 primitive programming constructs. The rest of the language is constructed with macros written in Clojure.

      Once I got Clojure installed it was easy to work with and program in. Most of the good features about Python also applies to Clojure: it is minimal and has batteries included. Still I think that Python is a simpler language than Clojure.

      Use case
      Clojure is a good way to script the extensive Java libraries, for rapid development. It has more natural interaction with MongoDB than Java.

      Clojure OpenNLP

      The clojure-opennlp project is a thin Clojure wrapper around OpenNLP. It came with all the corpora used as training data for OpenNLP nicely packaged and it works well. You can script OpenNLP approximately as terse as NLTK, from an interactively repl.

      I tried it in both Eclipse and NetBeans. They seem somewhat equal in number of features. I had a little better luck with the Eclipse version.

      clojure-opennlp is using a Maven built system, but has a nontraditional directory layout, this caused problems for both Eclipse and NetBeans, they both took some configuration.

      Eclipse Counterclockwise
      The Counterclockwise instruction for labrepl mainly worked for installing clojure-opennlp.
      When you were done you had to go in add the example directory the source directories under properties.

      NetBeans Enclojure
      I imported the project. I had to move the Clojure file from example directory to a different position to get it to work.

      Maven plugins for Clojure
      The standard Maven directory layout has several advantages, e.g. if you want to mix Java and Clojure code. I created my own Maven pom configuration file up, based on examples of other Clojure Maven projects. They used Clojure plugins for Maven, I could not get this to work. Eventually I ripped these plugins out and was left with very pain POM file that worked.

      Go / Golang

      Go was announced November 2009. It is created by Google to deal with multicore and networked machines. It feels like a mixture of Python and C. It is a very clean and minimal language.
      • It is fast
      • Good standard library
      • Excellent support for concurrency
      • It is trivial to write your own load balancer
      Issues
      • The Eclipse IDE is in an early stage
      • Debugger is not working
      • Windows port is not done and has just been released
      It was hard to find the right Go Windows port, there are several Go windows port projects with no code.

      Use cases
      I currently have a problem when downloading a lot HTML pages and parsing them to a tree structure. This does not have the best support in C#. I found a library that translates HTML to XHTML and then I can use LINQ to process it. The library is not documented, not very fast and fails for some HTML files.

      Go comes with a HTML library that parses HTML 5, it is simple to write a program with some threads that download and other that parse the files into a DOM tree structure.
      I would use Golang for loading large amounts of text in a cloud computing environment.

      Cython

      Cython was released in July 2007. It is a static compiler to write Python extension modules in a mixture of Python and C.

      Process for using Cython
      • Start by writing normal Python code
      • Find modules that are too slow
      • Add static types
      • Compile it with Cython using the setup tool
      • This produces compiled modules that can be used with normal Python
      Issues
      • It is still more complex that normal Python code
      • You need to know C to use it
      I was surprised how simple it was to get it working both under Windows and Linux. I did not have to mess with make files or configure the compiles. Cython integrated well with NumPy and SciPy. This expands the programming tasks you can do with Python substantially.

      Use cases
      Speed up slow POS tagging.

        My previous language experience

        Over the years I have experimented with a long list of non mainstream languages: functional, declarative, parallel, array, dynamic and hybrid languages. Many of these were frustrating experiences. I would read about a new language and get very excited. However this would often be the chain of events:
        • Download language
        • Installed Cygwin
        • Find out how the language's build system works
        • Try to find a version of the GCC compiler that will compile it
        • Get the right version of Emacs installed
        • Try to get the debugger working under Emacs
        • Start programming from scratch since the libraries were sparse
        • Burn out

        You only have so much mental capacity, and if you do not use a language you forget it. Only Scala made it into my toolbox.

        Do Clojure, Go or Cython belong in your programmer's toolbox

        Clojure, Go and Cython are all simple languages. They are easy to install, easy learn, they all have big standard libraries so you can be productive in them right away. This is my first impression:
        • Clojure is a good way to script the extensive Java libraries, for rapid application development and for AI work.
        • Go is a great language but it is still rough around the edges. There are not any big NLP libraries written for Go yet. I would not try to use it for my main NLP tasks.
        • Cython was useful right away for my NLP work. It makes it possible to do fast numerical programming in Python without too much trouble.


        -Sami Badawi

        10 comments:

        Paul Cowan said...

        I am using opennlp from Jruby so that is another option. I do not have to create a wrapper and can just access it direct.

        Alex Ott said...

        just fyi: I added your blog (by Clojure lable) into Planet Clojure, so if you'll write more about this language, all posts will fetched into it

        Regarding language itself - I often use Clojure for prototyping, and very easy work with Java libraries is useful. I'm also interested in NLP-related topics, although I'm only starting my journey into this world, so it could be interesting to see more work in this branch in Clojure
        About project management - for Clojure it could be much easier to use Leiningen instead of Maven, especially for simple projects. Although, I personally use Maven to build complex projects

        Laurens Van Houtven said...

        I do not understand why you say you need to know C to use Cython. I use both (and know both reasonably well), but I have never had the impression that I really needed to know my C to get great effect with Cython.

        Anonymous said...

        I've just built my first app using Cython... I don't know any C and got a little speed up but I think I would have got more from it if I did know a bit more C.

        Paul Hobbs said...

        I agree with Alex; use Leiningen instead of Maven. The layout of clojure-opennlp is actually a pretty conventional clojure setup. If you have leiningen, you can use "lein uberjar" to make a standalone jar for use in eclipse or netbeans; this is much easier than mucking around with maven.

        Kamil Kisiel said...

        For multicore processing in Python you'd be well served to use the multiprocessing module instead of threading. This avoids all the GIL issues altogether.

        If you want to take things even further, you can implement a work-queue model using ZeroMQ and even go multi-machine as well as mutli-process without a whole lot of additional effort.

        Unknown said...

        Hello, i would like to ask that what is the scope of C language training, what all topics should be covered and it is kinda bothering me … and has anyone studies from this course http://www.wiziq.com/course/2118-learn-how-to-program-in-c-language of programming in C ?? or tell me any other guidance...
        would really appreciate help… and Also i would like to thank for all the information you are providing on C concepts.

        Marilyn J. Hoffman said...

        It is a result of the requests and the necessities that the exposition must satisfy. However as an understudy you http://www.wwwritingservice.com/ - essay writing service ought not be debilitated in light of the fact that you can simply get paper composition administrations to help with the written work.

        Anonymous said...

        Your post contains useful information on this point as I am working on a college project. Thank you posting relative information and its now becoming easier to complete this topic. free vpn server

        Unknown said...

        Hello there! This is excellent and great. Interesting and full of valuable information. Its content are briefly explain. Thanks for sharing this. Keep it up.Luchtontvochtiger