I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


When numerical precision can hurt you

The objective was to cure a very deadly disease and the drug was tested on mice. The results were impressive since 33% of the mice survived while only 33% died (the last mouse escaped and its outcome was unknown).

Numerical precision depends on the underlying number type. In .Net, there are 3 choices float (32bits), double (64bits) and decimal (128bits). Performance left aside, more precision cannot hurt, right?

My answer is It depends. If the only purpose of your number is to be processed by a machine, then fine, more precision never hurts. But what if a user is supposed to read that number? I did actually encounter this issue while working on a project of mine Re-Dox, reduced design of experiments (an online analytical software). In terms of usability, provide the maximal numerical precision to the user is definitively a very poor idea. Does adding twelve digits to the result of 10/3 = 3.333333333333 makes it more readeable? definitively not.

A very insteresting issue while design analytical software (i.e. software performing some kind of data analysis) is to choose the right number of digits. Smart rounding can be defined as an approach that seeks to provide all significant, but only significant, digits to the user. Although, the notion of "significant" digits is very dependant of the context and carries a lot of uncertainties. Therefore, for the software designer, smart rounding is more likely to be a tradeoff between usability and user requirements.

Providing general rules for smart rounding is hard. But here are the two heuristics that I am using. Both of them rely on user inputs to define the level of precision required. Key insight: since it's usually not possible to know the accuracy requirements beforehand, the only reliable source of information is the actual user inputs.

Heuristic 1 = the number of digits in your outputs must not exceed the number of digits of user input by more than 1 or 2. Ex: If the user input 0.123 then provides a 4 or 5 digits rounding. Caution, do not take the user inputs "as such", because they can include a lot of dummy digits (ex: the user can cut and past values that look like 10.0000, where the digits is zero and implicitely not significant). The underlying idea is "no algorithm ever creates any information, an algorithm only transform the information".

Heuristic 2 = increase the number of digits of the heuristic 1 by a number equal to CeillingOf(log10(N)/2) where N is the number of data inputs. Actually, this formula is simply an interpretation of the Central Limit Theorem (Wikipedia) for the purpose of smart-rounding. Why the need for such bizarre heuristic? The underlying idea is slightly more complicated here. Basically, no matter how you combine the data inputs, the rate of accuracy improvement is bounded. The bound provided here corresponds (somehow) to an "optimistic" approach where the accuracy increase at the maximal possible speed.


Refactoring and logistics ("L'intendance suivra!")

With Eclipse and VS2005, refactoring is now a standard feature of modern IDEs. No more than few minutes are now sufficient to drastically change the internal structure of a software library. Yet, if software logistics cannot keep the pace then productivity bottlenecks of software evolution remain unchanged. De Gaulle said L'intendance suivra! (which could be poorly translated by "Logistics always keep up!"). Yet many european wars have been lost due to poor logistics, and, back to the discussion, I believe that logistics is no less important in software matters than it is in wars.

By software logistics I mean all processes required to keep the whole thing running when changing the structure of a component; and, in particular, the issues related to upstream and downstream dependencies (deployement issues are left to a later post).

Upstream dependencies include all the components that you are relying on. Well, changing your code does not impact the components you are relying on, right? Wrong. Just consider serialization as a simple counter-example (your code evolves and let all existing data unreadable). Upstream issues are not serialization-specific, the very same problem exists with structured database storage (yes, versioning SQL queries is a pain too). More the generally, I am not aware of any refactoring tool providing any support to tackle data persistence issues. Although version tolerance features do exists in .Net 2.0 (but the framework is still lacking very simple features such as field renaming). I am really looking forward the tools that will (one day) provide persistence-aware refactoring capabilities.

Downstream dependencies include all the components that rely on your's. It's clear that any change you do (at least for any publicly exposed method/object) can break the downstream dependencies. The don't break anything approach is a killing approach in terms of software evolution (the most interesting case being don't fix bugs, our customers rely on them). But on the other hand, assuming that L'intendance suivra! ("we don't care, RTFM!") is not realistic. IMO, the present best practice is to provide an awful lot of documentation to facilitate the upgrade (see Upgrading WordPress for a good example). Yet providing such documentation is expensive and the ROI is low (because contrary to the code that has been improved, such documentation will not be leveraged in future developments). The solution, IMO again, lies in better refactoring tools that include some downstream-awareness. It would be possible to generate some "API signature patch" (the generation being a process as automated as possible) that could be applied by the client on its own code. Well, I am also really looking forward for such kind of tools.


Hungarian notation and thread safety

Joel Spolsky had a very good Making Wrong Code Look Wrong article where he rehabilitates the hungarian notation for certain dedicated purposes (tips: no, hungarian is not about junking your code with variable prefix such as string, int and array). Joel Spolsky presents the idea of prefixing unsafe (i.e. user provided) strings in the context of web-based application with us, standing for unsafe string. Such practice makes a lot of sense in situations where things have to be right by design (security is a typical example because no security holes are going to pop-up against typical non-hostile users).

Thread safety (TS) issues

Beside security, thread safety, for multithreaded applications, is another area that has to be right by design otherwise you are going to hit Heisenbugs. In .Net/C#, the most simple way to deal with thread safety is to rely on lock statements (some people would object that locking is unsafe by design and that transactions/agents/whatever should be used instead, that might be true but this is certainly beyond the scope of this post). Actually, I have found that designing multithreaded applications is not that hard but it requires some strong methodology to avoid things to get wrong by design.

TS level 1 : dedicated locks

A very common beginner mistake at multithreading to re-use some random object to ensure the locking scheme, worst of all, using the this statement.

public class Bar
public void FooThreadSafe()
lock(this) { } // bravo, you just shoot your feet

Although, it may looks like a 12bytes overhead, a lock requires a dedicated object field in the class.

public class Bar
private object fooLock = new Lock();
public void FooThreadSafe()
lock(fooLock) { } // much better

The majar advantage of the dedicated lock instance approach is documentation. The purpose of dedicated object documentation is to explain which objects should be protected and when (Intellisense-like IDE feature makes this approach even more practical). A secondary advantage is encapsulation. The library end-user (i.e. the intern next door) is not aware of the "dedicated lock instance" stuff and might re-use whatever lockYXZ variable available for his own dark purposes. Don't offer him the chance to do that, mark the field as private.

TS level 2 : hungarian locks

Beside unprotected (concurrent) accesses, the second most important issue in multithreaded applications is deadlocks. Against deadlocks, there is only one known solution (to my knowledge) : complete and permanent ordering of the locks. In other words, locks should always be taken in same order. You might think, "Ok, let's just add a line in the dedicated lock object documentation to specify that." Well, I have tried, it just does not work. Because, over time, your application evolves, the intern next-door don't care about your documentation, new locks are added to fit some other requirements and the initial complete ordering is no more (although it should, but developpers are just so-not-failproof humans). The problem with those deadlocks is that you have no way to detect them just by looking at a piece of code. You have to look at the whole application code to ensure consistency. Here comes the hungarian rescue : the hungarian dedicated lock naming convention.

The locks should be taken in the order specified by the hungarian prefixes.

Let's look at a small example, if you have two dedicated lock instances called lock1Bar and lock2Foo (lockX being the hungarian prefix), then you know that lock1Bar should be taken before lock2Foo. Any exception to this rule is a guaranteed-or-reimbursed deadlock. Additionally, refactoring tools make it so easy to rename all your variables as many time as you need that there is really no practical obstacle to implement such policy.

Page 1 ... 27 28 29 30 31