I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


Thinking Big Data for commerce

As of October 2013, Google Trends indicates that the buzz around Big Data is still growing. Based on my observations of many services company (mostly retailers though), I believe that it's not all hype, and that indeed Big Data is going to deeply transform those businesses. However, I also believe that Big Data is wildly misunderstood by most, and that most Big Data vendors should not be trusted.

Search volume on the term "Big Data" as reported by Google Trends

A mechanization of the mind based on data

Big Data, like many technological viewpoint, is both quite new, and very ancient. It's a movement deeply rooted in the perspective of mechanization of the human mind which started decades ago. The Big Data viewpoint states the data can be used as the source of knowledge to produce automated decisions.

This viewpoint is very different from Business Intelligence or Data Mining, because humans get kicked out of the loop altogether. Indeed, producing millions of numbers a day cost almost nothing; however, if you need people to read those numbers, the costs are fantastic. Big Data is mechanization : the number produced are decision and nobody is needed to interfere at the lowest level to get things done.

The archetype of the Big Data application is the spam filter. First, it's an ambient software, taking important decisions all the time: deciding for me which message is not worth my time to read certainly is an important decision. Second, it has has been built based on the analysis of lot of data, mostly by establishing databases of messages labeled as spam or not-spam. Finally, it requires almost zero contribution of its end-user to deliver its benefits.

Not every decision is eligible to Big Data

Decisions that are eligible to a Big Data processing are everywhere:

  • Choosing the next most profitable prospect to send the paper catalog.
  • Choosing the quantity of goods to replenish from the supplier.
  • Choosing the price of an item in a given store.

Yet, such a decision needs two key ingredients:

  1. A large number of very similar decisions are taken all the time.
  2. Relevant data exist to bootstrap an automated decision process.

Without No1, it's not cost-efficient to even tackle the problem from a Big Data viewpoint, Business Intelligence is better suited. Without No2, it's only rule-based automation.

Choosing the problem first

The worst mistake that a company can do when starting a Big Data project consists of choosing the solution before choosing the problem. As of fall 2013, Hadoop is now the worst offender. I am not saying that there is anything wrong with Hadoop per se; however choosing Hadoop without even checking that it's a relevant solution is a costly mistake. My own experience with service companies (commerce, hospitality, healthcare ...) indicates that, for those companies, you hardly ever need any kind of distributed framework.

The one thing you need to start a Big Data project is a business problem that would be highly profitable to solve:

  • Suffering from too high or too low stock levels.
  • Not offering the right deal to the right person.
  • Not having the right price for the right location.

The ROI is driven by the manpower saved by the automation, and by getting better decisions. Better decisions can be achieved because the machine can be made smarter than a human with less than 5s of brain-time per individual decision. Indeed, in service companies, productivity targets imply that decisions have to be made fast.

Your average supermarket has typically about 20,000 references, that is, 330 hours of work if employees spend 1 minute per reordered quantity. In practice, when reorders are made manually, employees can only afford a few seconds per reference.

Employees are not going to like Big Data

Big Data is a mechanization process. And, it's usually easy to tell to spot the real thing just by looking at internal changes caused by the project. Big Data is not BI (Business Intelligence) where employees/managers are offered a new gizmo, changing nothing to the status quo. The first effect of Big Data project is typically to reduce the number of employees (sometimes massively) required to process a certain type of decisions.

Don't expect teams soon-to-replaced-by-machines to be overjoyed by such a perspective. At best, upper management can expect passive resistance from their organization. Across dozens of companies, I don't think I have observed, among services companies, any Big Data project succeeding without a direct involvement from the CEO herself. That's unfortunately the steepest cost of Big Data.

The roots of Big Data

The concept of Big Data did emerge only in 2012 among mainstream media, but its roots are much older. Big Data results from 3 vectors of innovations:

  • Better computing infrastructures to move data around, to process data, to store data. The latest instance of this trend is cloud computing.
  • Algorithms and statistics. Over the last 15 years, the domain of statistical learning has exploded (driverless cars are a testimonial of that).
  • Enterprise digitalization. Most (large) companies have completed their digitalization process where each key business operation has its counterpart digital record.

Big Data is first a result of the mix of those 3 ingredients.

Big Data = Big Budget?

Enterprise vendors are now chanting their new motto big data = big budget, and you can't afford not to take the Big Data train, right? Hadoop, SAP Hana, Oracle Exalytics, to name a few, are all going to cost you a small fortune when looking at TCO (total cost of ownership).

For example, when I ask my clients how much does it cost to store 24TB  of data on disk? Most of the answers come above 10,000€ per month. Well, OVH (hosting company) is offering 24TB servers from 110€ / month. Granted, this is not highly redundant storage, but at this price, you can afford a few spares.

Then, the situation is looks even more absurd when one realizes that storing 1 year of receipts of Walmart - the largest retailer world-wide - can fit on a USB key. Unless you want to process images or videos, very few datasets cannot be made to fit on a USB key.

Here, there is a media-induced problem: there are about 50 companies world-wide who have web-scale computing requirements such as Google, Microsoft, Facebook, Apple, Amazon to name a few. Most Big Data frameworks originate from those companies: Hadoop comes from Yahoo, Cassandra comes from Facebook, Storm is now Twitter, etc. However, all those companies have in common of roughly processing about 1000x more data that the largest retailers.

The primary cost of a Big Data project is the focus that top management needs to invest. Indeed, focus comes with a strong opportunity cost: while the CEO is busy thinking how to transform her company with Big Data, fewer decisions can be made on other pressing matters.

Iterations and productivity

A big data solution does not survive its first contact with data.

Big Data is a very iterative process. First, the qualification of data (see below) is an extremely iterative process. Second, tuning the logic to obtain acceptable results (the quality of the decisions) is also iterative. Third, tuning the performance of the system is, again, iterative. Expecting a success at first try is heading for failure, unless you tackle a commoditized problem, like spam filtering, with a commoditized solution, like Akismet.

Since many iterations are unavoidable, the cost of the Big Data project strongly impacted by the productivity of the people executing the project. In my experience at teaching distributed computing at the Computer Science Department Ecole normale supérieure, if the data can kept on a regular 1000€ workstation, the productivity of any developer is tenfold higher compared to a situation where the data has to be distributed - no matter how many frameworks and toolkits are thrown at the problem.

That's why it's critical to keep things lean and simple as long it's possible. Doing so gives your companies an edge against all your competitors who will tar pit themselves with Big Stuff.

Premature optimization is the root of all evil. Donald Knuth. 1974

Qualifying the data is hard

The most widely estimated challenge is Big Data is the qualification of the data. Enterprise data has never been created with the intent to feed some statistical decision-taking processes. Data exist only as the by-product of software operating the company. Thus, enterprise data (nearly) always full of subtle artifacts that need to be carefully addressed or mitigated.

The primary purpose of point of sales is to let people pay; NOT to produce historical sales records. If a barcode has become unreadable, then, many cashiers might just scan twice another item that happens to have the same price than the non-scannable one. Such a practice is horrifying from a data analysis perspective, but looking at the primary goal (i.e. letting people pay), it's not unreasonable business-wise.

One of the most frequent antipatterns observed in Big Data projects is the lack of envolvement of the non-IT teams with all the technicalities involved. This is a critical mistake. Non-IT teams need to tackle hands-on data problems because solutions only come from a deep understanding of the business.


Big Data is too important to be discarded as an IT problem, it's first and foremost a business modernization challenge.


A buyer’s guide for enterprise software

Through a series of Big Data consulting missions that were overlapping with the entire IT landscape, data being all over the place, I have observed software purchasing processes of many large companies. Being also an enterprise software vendor myself, I have been baffled countless times by broken buying processes that lead smart people routinely choose about the worst price-quality ratio that the market has to offer.

In this post, I am trying to gather a survival kit for buying enterprise software. By enterprise, I refer to companies with over 1,000 employees.

Get rid of Requests for Quotes (RFQ)

The rational goes like this: let’s write down all our requirements, then, we identify all suitable vendors, send them the RFQ, collect the quotes, review demos, and finally select the best option. Fairly reasonable, and yet, as far software is concerned, deeply dysfunctional.

Why? Because writing an accurate specification for software you need is harder than writing the software itself. It’s somehow the Heisenberg uncertainty principle transposed to software. Just try to write a RFQ for a webmail (aka GMail) if you’re not convinced. Specifying the fine print would already takes hundreds of pages, that’s only a webmail. Enterprise software tends to be a lot more complex than webmail.

As anecdotal evidence, writing and managing an RFQ is so time consuming and complex that many companies throw consultants at the problem, increasing costs even further.

Then, when the whole process is in place, guess what happens? All decent vendors walk away. Indeed, if you have a decent product that sells by itself (think Microsoft Excel), why would you bother paying an army of account managers to walk through broken RFQs? In the end, the only vendors left are either the ones so outrageously expensive that they can afford whatever it takes, or the ones selling crappy products that would never get sold without the active contribution of dysfunctional (yet widespread) processes.

My advice: Take your favorite search engine and forge yourself an opinion. It’s easier than you think, and in less than 3h, you're likely to have already a convincing shortlist of vendors. A few tips:

  • If all you find about a software is happy talk, software is worthless or vaporware.
  • If you can’t find screenshots of the software, then it means that the user interface, and probably the user experience as well, is abysmal.
  • If the web documentation is awful or absent, then whatever private documentation exists, it won’t be better.
  • If there is no public pricing, then, you will face somebody who’s professionally trained in reverse engineering the exact budget you have. That's going to be a time sink.
  • If the company does open source, then bonus points. They are not afraid that other people might have a look at the software code they produce.

Customizing Off-the-shelf Software is deadly

Joel Spolsky stated that it takes 10 years to write good software, and looking at the development curve of my own company I tend to agree. Well, it took us 5 years to realize that we were not even solving the correct problem (but this will be the subject for another post).

Software is (mostly) a take it or leave it business. Yes, you can request tiny adjustments, but asking for anything substantial is like swimming in molasses, mostly because of hidden costs. Generic upgrades won't work, support staff will be incompetent, design balance will be thrown out of whack, etc. As a result, good software companies, where people truly care and dedicate a good portion of their lifespan in carefully crafting truly valuable products, will actually decline such requests.

In contrast, vendors who will gladly accept customization requests are the one putting little value on the integrity of their product; which has already degenerated into some byzantine architecture through the acceptance of disparate customization requests over time.

My advice: choose your battles wisely. If a feature gives you an edge against the competition then internalize it and treat it as a core asset. Otherwise, it’s easier and cheaper to adjust your own organization to whatever software which reflects the dominant practice - as long the new practice offers a tangible improvement over the old one.

Bargaining over the price is against your interest

Bill Gates said that you don’t get what you deserve: you get what you negotiate. True enough but you should wisely choose what you want to negotiate. Negotiating over the software pricing is enormously expensive for the vending party. It takes a highly talented workforce, with employees both technically savvy and yet having all the smooth skills it takes to interact with large organizations. In most Western countries, the yearly total cost of a seasoned account manager is above $150k.

When you start bargaining with software companies, you also start an adverse selection process. Only the companies willing to afford an expensive sales force remains while others walk away. The vendors who stay are the ones where the business model is geared toward an ever increasing sales force. However, you want to put your money on a vendor that invest the bulk of her revenue in developing good software, not funding an army of people that specializes in reverse engineering budgets.

My advice: negotiating the price downward is usually a dead end. If it's out of budget, then discard the vendor and move to the next one.

Instead of focusing on price, you should try is to capture as much attention as you can from the core development team. Since you are an enterprise (hence a large prospect), you already got an edge here. For example, you can negotiate a small case study against a series of meetings with whoever are in charge of the core product development.

By making sure that the people in charge of the product are familiar with your business, you are vastly improving the odds that the future developments will be aligned to your company needs. Furthermore, you won’t even need to fund or think about those enhancements, the people you’ve met will do this free of charge; because it’s a natural thing engineers do when learning about client's problems.

More is less and stay clear of platforms

The inevitable corollary of RFQ is that the more advertised features the better. Furthermore, the impression is amplified by the vendors themselves who promote an ongoing stream of new features to sustain a form of recurring business (even for SaaS).

In the context of enterprise software, more features raise a very specific problem: soon enough supposedly distinct products start to overlap.

  • Both CRM and CMS (Content Management System) want to capture web leads.
  • Both the inventory management system and the accounting system want to manage suppliers.
  • Both BI and Web Analytics want to analyze the sales channels.
  • ...

Overlapping is deadly because data flows within the IT landscape start to look like the Tokyo subway map. Indeed, each time, on both ends of the overlap, divisions want to use their software, and consequently IT gets forced into moving the data around, struggling with inconsistent domain models.

Platforms are the worst offender here, because platforms are bound to overlap with practically everything else in the company. Worse, platforms create rampant dependencies making it quasi-impossible to get rid of the platform vendor later on.

My advice: favor highly focused app over jack-of-all trade apps. If you get it wrong, it will fail fast, and you will have ample possibility to try again. Most enterprises incorrectly think that managing many vendors is a problem, hence favoring Big Systems. However, dealing with hundreds of apps is quasi-painless - well, for a large company - as long as apps remain decoupled, and as long as you’re not bargaining with an army of vendors. In contrast, it only takes two overlapping platforms to create an IT integration nightmare.

Time is of the essence

In the world of enterprise software, vendor lock-in is all over the place; and yet, all enterprises I met had supposedly everything in place to avoid those lock-ins: contractual reversibility (to be able to revert to back to the previous system), contractual migration support (to be able to move forward to a new system), favorable termination clauses, etc. On the surface, extensive safeguards were in place against any vendor lock-in.

How could the reality be so different from the theory? It’s because time is the strongest and most widely underestimated vendor lock-in mechanism. What’s the point of being able to terminate a contract any day if it takes years to de-entangle operations from the vendor? As a rule of thumb, phasing out an enterprise vendor take about 3x as long as phasing the vendor in; and if it’s a platform, then you can reasonably consider your company locked on the platform forever.

Indeed, don’t expect vendors to be overly motivated by the prospect of having their products being phased out. At best, the legacy vendor will be slow but ultimately responsive to whatever problems which will necessarily arise during the transition. But that's the best case ...

My advice: Internet was a revolution. SaaS was a revolution. Cloud computing was a revolution. Mobile internet was a revolution. Software is a fast-paced industry, even more than consumer electronics. Would you buy a tablet and consider it’s an asset to be amortized over the next decade? Certainly not, it’s the same for enterprise software. No matter what software you are considering, if you can’t roll out in a matter of weeks, then move on to the next vendor. Otherwise, by the time your company is done with its acquisition, the software industry will have moved to its next revolution, you will be left with a freshly obsolete technology.

Your data is your DNA, take ownership

There is one area where COTS (commercial off-the-shelf software) works poorly for large enterprises: it’s the orchestration of the data. By orchestration the data, I am not referring to databases, ESB (Enterprise Service Bus) or equivalent low-level generic data system; I am referring to the layer on top that unify the IT landscape. This layer typically involves a mix of people and some middleware to carry on with the changes.

Indeed, COTS necessarily carry strong domain models, that it, the abstract software representation adopted to model the business itself. However, there is (almost) not a single chance that any of those predesigned models would fit a large company made of fusions, acquisitions, restructurings and possibly somewhat heterogeneous branches. Some software, notoriously SAP, can be made to fit practically the domain any large company, but there are so many developments involved, that it hardly counts as off-the-shelf software.

My advice: if there is one area where every large company should have a small team of software developers, it’s its own private data platform: an entity dedicated to the collection and the service all the data generated within the IT landscape. Here I strongly favor in-house software developers, because business data is always core business, no matter what your company is doing.

Furthermore, for most enterprises, a small tiger team of software developers in charge of the data would typically vastly reduce IT spendings compared to vast teams of recycled IT workforce. Indeed, great control over your data grants you the capability to swiftly phase software in and out, and speed is decisive. In practice, it does not take many people, but it takes talented people.


8 tips to turn your Big Data into Small Data

Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.

The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.

While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.

The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.

1. Get rid of everything that is not required

While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.

2. Turn dates in 16-bits integers when the time is not needed

A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.  

3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values

Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.

4. Replace strings by keys with lookup tables

The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.

5. Get rid of objects, use value types instead

Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.

6. Use plain arrays not "smart" collections

Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.

7. Use variable length encoding

The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.

8. Vectorialize listing when possible

Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.


Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.


A few tips for Big Data projects

Floppy disk illustrationAt Lokad, we are routinely working on Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let's cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake ...

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs. 

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention. 

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

  • Nobody is going to have a look at that many numbers.
  • Opportunities exist to solve a better problem.
  • Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
  • If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won't enter (here) the debate SQL vs NoSQL, instead let's outline that whatever persistence approach is adopted, it won't help: 

  • figuring out if the problem is the proper one to be addressed,
  • assessing the usefulness of the analysis performed on the data,
  • blending Big Data outputs into user experience.

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it's the wicked parts of the problem that need the most attention.

Page 1 ... 2 3 4 5 6 ... 32 Next 5 Entries »