Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta

Entries in opensource (10)

Monday
Feb162015

Super-fast flat file parsing in C# and Java with a perfect hash function

At Lokad, (almost) all we do is to crunch flat text files. It's not that we haven't tried anything else - we did - many times - and it went poorly. Flat files are ubiquitous, well understood, and they yield very good performance both of the write side and the read side when working under tight budgets.

Keep in mind that the files we crunch are frequently generated by our clients, so while ProtoBuf or Cap'n Proto are very cool, asking our clients to deliver such formats would be roughly equivalent asking them to reimplement their in-house Java ERP in Haskell. To preserve the sanity of our clients, we keep it simple and we stick to flat files.

However we have decided to make flat file read fast, really fast. Thus, one of us decided to tackle the challenge dead-on, and came up with a very nice pattern: file parsing starts with a Perfect Hash Function preprocessing. Simply put, the flat file gets tokenized, and then each token gets replaced by an integer uniquely identifying this piece of string. Not only this saves a tremendous amount of string object instantiation, but afterward, all the complex parsing operations, such as parsing a date, can be performed only once, even if the token is encountered hundreds of times in the file. Performance-wise, it works because flat files tend to be very denormalized and very redundant.

We have released a tiny open source package codenamed Lokad.FlatFiles for C#/.NET (and a Java version too) under the MIT license. This library takes care of generating the perfect hashes out of a flat file. Our (unfair) benchmarks indicate that we typically reach about 30MB/second on a single CPU. Then, when the subsequent parsing operations take advantage of the token hashing, the speed-up is so massive that this initial perfect hashing tend to completly dominate the total CPU cost - so we stay at roughly 30MB/second.

Saturday
May152010

Really Simple Monitoring

Moving toward cloud computing relieves from (most) hardware downtime worries, yet, cloud computing is no magic pill that garanties that every single of our apps is ready to serve users as expected.

You need a monitoring system to achieve this. In particular, OS uptime and simple HTTP responsiveness is only scratching the surface as far monitoring is concerned.

In order to go beyond plain uptime monitoring, Lokad has started a new Windows Azure open source project named Lokad.Monitoring. The project comes with several tenants:

  • A monitoring philosophy,
  • A XML format, the Really Simply Monitoring (shamelessly inspired by RSS),
  • A web client for Windows Azure

Beta is version is already in production. Check project introduction page.

Thursday
Jul302009

Lokad.Cloud - alpha version released

One of the major little-known weakness of cloud computing is development productivity. Indeed, developing over the cloud ain't easy, and as complexity goes, the management of a complex, fully-distributed app may become a nightmare. At Lokad, as we started migrating a fairly complex technology, we did get the feeling that we were needing strong patterns and practices - tailored for the cloud - so that we don't get lost half-way in the migration process.

That's how Lokad.Cloud was born.

In short, Lokad.Cloud is a framework that can be used to rationalize and speed-up development of back-end apps over Windows Azure. Read more on the announcement made directly on the Windows Azure Forums.

Monday
Jul282008

Migrating from OnTime to Trac, a short review

I have been a long time user of the project tracker OnTime provided by Axosoft. Yet, at Lokad, we have just migrated to Trac, a open source project tracker.

Although OnTime is a good product, there are quite a few elements definitively in favor a Trac

  • low ceremony: Trac has no advanced workflow, no 10 fields bug entry forms, no team reporting dashboard - but it just works. When it comes to web app, less is more. If you can pinpoint a bug in one sentence, then filling a 6 steps bug replication form is just a waste of time.

  • pretty URLs: that one is very often neglected by ASP.NET developers. It's really nice to be able to copy a URL such as http://foo.com/trac/ticket/17 into a mail, a wiki or even to bookmark it. Then, every single view in Trac has its own URL ready to be shared. In this respect, I have felt that the AJAX upgrade of OnTime, one year ago, was a downgrade from the usability viewpoint, because with AJAX, you loose both URLs and the ability to hit "back" on your web browser.

  • emphasing usability and not coolness: when I select an item on Trac, I get the complete view of the item in a simple webpage. Agreed, the page design not super elegant, but since scrolling up and down is a mechanical feature of my mouse, and it happens to be really efficient - especially compared to the tiny AJAX tabs of OnTime.

  • SVN integration: Trac let you browse the SVN source and associate SVN commits can be associated to Trac tickets. That one feature is a killer.

Disclaimer: OnTime is probably meant to be used through the Visual Studio add-in, yet, for some reason, I never managed to convince myself of actually installing the add-in, and I did stick to the hosted edition of OnTime. Thus, the comparison might be entirely fair.

Thursday
Jul052007

More ScrewTurn tips (feed2js plugin)

ScrewTurn is used as a CMS for Lokad.com. Along with the main website www.lokad.com, we have also blog.lokad.com and forums.lokad.com that both provide RSS feeds. For a long time, I have been looking for a simple way to include RSS snippets directly in the web pages without any satisfying solution. You can have a look at the right sidebar of Lokad Products page to see what I mean by "Feed Snippet".

Feed2js for RSS snippet inclusion


For one month, I have been using Feed2js.org that provide a javascript based solution to include RSS snippets in your webpages. The Feed2js approach has one single major advantage: simplicity. Just cut-and-paste a sample javascript that call Feed2js.org and you're done. Unfortunately, this approach has two major drawbacks. First, it's roughly double the latency to complete the webpage retrieval. With Lokad.com, the overhead delay was sufficient to be noticed even when a DSL connection. Second, it potentially expose your website to cross-scripting attack (I am only saying potentially because Feed2js.org has proved to be very reliable in my experience).

Feed Snippet Plugin for ScrewTurn


For those reasons, I have settled myself for an improved (yet home-made) solution for ScrewTurn: a custom RSS feed snippet plugin. This piece of code relies on the ScrewTurn plugin framework. It retrieve the RSS feed on the server side and then use the ASP.Net caching mechanism to keep a copy for 1h before re-downloading the RSS feed. With this plugin, you can include a RSS feed snippet in your ScrewTurn page with a single line:

<feed itemCount="4" dateFormat="yyyy-MMM-dd">http://myfeedurl</feed>

The attribute itemCount indicates the maximal number of RSS items to be displayed in the web page. The attribute dateFormat corresponds to the .NET DateTime formatting option to be used to display the post publication date.

Download: FeedSnippetPlugin.cs.zip