Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta

Entries in bigdata (4)

Sunday
Dec172017

Terabyte blocks for Bitcoin Cash

Update: the cost estimates below have been significantly improved since the orginal publication of this entry - by a rough factor 20. See the talk at Satoshi's Vision in March 2018.

Terabyte blocks are feasible both technically and economically, they will allow over 50 transactions per human on earth per day for a cost of less than 1/10th of a cent of USD. This analysis assumes no further decrease in hardware costs, and no further software breakthrough, only assembling existing, proven technologies.

Introduction

As pointed out in the original Bitcoin whitepaper, achieving very large blocks do require taking advantage of Moore's Law rather than being stuck with fixed-capacity device. A terabyte block represents a block of 1e12 bytes, which can contain about 4 billion Bitcoin transactions. Assuming a worldwide population of 10 billion humans, terabyte blocks offer about 50 transactions per human per day (57 actually, but the extra numerical precision is not significant).

50 transactions per day per human appears sufficient to cover all human-driven activities; and only a healthy machine-to-machine market would require an even greater number of transactions. Such a market remains hypothetical at present time, and goes beyond the scope of this post.

Bigger blocks is the go-to plan to make the most of the hashing power invested in the Bitcoin network. Indeed, the hashing power provides the same security no matter if 1MB blocks or 1TB blocks are used, yet in the later case, the each transaction is secured with a million times less energy per transaction.

The on-chain scalability challenge is irrelevant for Bitcoin Core, as blocks are capped at 1MB which ensures no more than half a dozen of transactions per second. However, terabyte blocks are relevant for Bitcoin Cash, which could face about 7 millions transactions per second while producing terabyte blocks. In the following for the sake of concision, the term Bitcoin is always referring to Bitcoin Cash.

The mining rig detailed below, a combination of existing and proven hardware and software technologies, delivers the data processing capacity to process terabyte blocks. The cost associated to this mining rig is also sufficiently low to ensure a healthy decentralized market that includes hundreds of independent miners; arguably a more decentralized market than Bitcoin mining as of today.

For the sake of the scalability analysis, I am excluding the Bitcoin emission revenues, focusing only on the transaction fees and other alternative revenue streams which do not depend on Bitcoin inflation. Naturally, for the next decades, the bulk of the mining revenues are expected to be associated with the emission of Bitcoins rather than transaction fees.

A terabyte block mining rig

The mining rig includes 256 nodes, where each node includes:

  • 1 Intel Xeon Processor E7, 8 cores (USD 1250)
  • 2 Intel Xeon Phi 7210, 64 cores (USD 4000)
  • 1 Intel Optane 4800X 750GB (USD 3400) 
  • 2 Samsung 64GB PC4-19200 DDR4 (USD1400)
  • 2 WD Red 10TB HDD (USD 750)
  • Misc (rack, power, network) (USD 3000)

The prices have been obtained from public sources such as Amazon. Totaling those 256 nodes gives a price point of 3.5M USD. In addition to the nodes, a storage layer of optical storage based on the Freeze-Ray technology of Panasonic. While the pricing point of this technology is not publicly advertized, various sources are quoting 10 USD/TB as the price point for optical storage. This figure also matches the price point of the optical storage cartridges sold by Sony. Then, Facebook, who has deployed the freeze-ray claims a 80% reduction in energy consumption compared to HDDs. As current 10TB HDDs have a typical consumption of 5W when active, the freeze-ray energy consumption should be about 0.1W per TB. I will be using those two estimates in the following.

In order to cover 20 years worth of terablocks, the storage layer would require 553 free-ray rackable units of 1.9PB, which would represent a cost of 11M USD.

Then, the cost of energy should be accounted for. Each node consumes about 700W/h according to the nominal consumption of its parts, which gives about 180kW/h for the 256 nodes. Also with 0.1W per TB, the storage layer consumes an extra 100kW/h. Assuming a kWh at 0.1 USD, the yearly energy consumption cost would be 250k USD, totaling 5M USD over 20 years.

Finally, a 50Gbps internet connection is added for a price of 25,000 USD per month; which totals at 6M USD over 20 years.

The cost for the mining equipment per se, i.e. computing terahashes is voluntarily ignored, because this hardware can be considered as independently funded through the Bitcoin inflation.

Thus, I am considering here a 26M USD investment, to be amortized over 20 years; that is 1.3M USD/year of funding. At this point, it still needs to be proven that (A) the Bitcoin fee market can sustain such an expensive mining rig (B) this mining rig is capable of processing terabyte blocks.

As it does not make sense to build such a rig if the market cannot reasonably fund it, let's start with the financing part.

Financing terablocks

Assuming 250 bytes per transaction, terabyte blocks would deliver about 55 transactions per human per day, assuming a rough 10 billion humans on earth. The exact count of human is not important, as the cost of the mining rig is essentially linear in the number of transactions, which is also itself essentially linear in the number of humans transacting on the blockchain. If there are less humans using the blockchain, then the mining rig is linearly cheaper.

If we assume that the same 10 billion humans contribute 1/10th of a cent per day to fund the miners through their transaction fees, then the yearly transaction fees would be of 3,65 billion USD. Thus, those yearly transaction fees would cover the amortized cost of over 3650 / 1.3 = 2800 mining rigs. Assuming that miners want to profit beyond the marginal cost of operating a rig, a gross operating margin at 60% would still leave room for over 1000 profitable miners.

Funding a large number of copies of the blockchain is important to ensure a high degree of decentralization. The analysis that has been carried so far shows that minimal transaction fees would be sufficient to fund over a hundred of competing yet very profitable miners. However, our analysis is ignoring all the economic value that can be generated by holding a copy of blockchain data for other purposes than validating transactions.

Let's assume that a wallet app, which display an ad along with Bitcoin balance could earn about 1 USD per 100,000 views for a simple non-intrusive banner. Assuming that the same humans would check their balance once a week on such a service, we are considering a 5 billion USD market just through advertising revenues, which would fund hundreds of additional copies of the blockchain. This single use case does fund by itself hundreds more copies of the terabyte blockchain.

Then, assuming that the upper bound for the monetization of a terabyte blockchain is at 0.5 USD per user per year, is conservative. In 2016, Google is extracting about 7 USD per user per year, while Facebook is extracting about 16 USD per user per year. If Bitcoin reaches 1TB per block, a large portion of the world economy will be running on top of this blockchain offering numerous monetization opportunities.

I fail to see why, collectively, the market would not manage to extract at least 5 USD per user per year on average through blockchain related services. At this point, we are entering the realm of profitably funding a thousand more copies of the terabyte blockchain.

Scaling the terabyte blockchain

Some data processing problems are intrinsically difficult to spread over multiple computers (like machine learning) or are even designed to prohibitively difficult (like breaking encryption). However, Bitcoin is neither. Bitcoin is an embarrassingly parallel, the easiest and most straightforward kind of problems to be addressed through distributed systems.

The scalability challenges faced by Bitcoin are:

  1. Propagating transactions
  2. Validating transactions
  3. Building and broadcasting blocks

Let's review each one of those challenges.

Scaling the transaction propagation

Propagating transactions is the easiest. It merely requires bandwidth. As Bloom filters, or even better filters can be used, the P2P propagation of 1TB worth of transactions needs less than 3 TB of bandwidth per miner every 10 min (assuming that miner resent twice every transaction for fast propagation of the transaction through the network). Indeed, miners transmit the filters first which are vastly more compact, and transfers the actual transactions only when those transactions is actually requested.

A direct calculation gives a minimal requirement of 45 Gbps to operate. The mining rig has 50 Gbps which is sufficient to reach a sustained throughput of 1TB blocks while aggressively relaying transactions.

Scaling the cryptographic validation

Validating the correctness of a Bitcoin transaction is a two-fold process. First, the cryptographic correctness of the transaction must validated: the miner must verify that the transaction has been properly signed by the sender of the funds. Second, the economic correctness of the transaction must be validated: the miner must verify that the originating address contains enough fund to cover the transaction. In this section, I am focusing on the first part of this challenge, the cryptographic correctness.

Based on [1], I assume a 2ms CPU cost per transaction on a regular 2Ghz x86 CPU. At 250 bytes per transaction, a 1TB block every 10 mins represents 6.7 millions transactions per second. With 2ms of CPU per transaction, we need 13400 CPUs to perform the concurrent validation. The mining rig contains 256 * 2 * 64 = 32768 CPUs through the the Intel Xeon Phi boards. The mining rig is largely sufficient to keep up with the transaction validation. The rig has even spare capacity to catch-up with a delayed validation which could, for example, occur in case of a local network outage. As transactions can be trivially partitioned against a fast hash, achieving a linear scaling of the cryptographic validation is straightforward.

Scaling the economic validation

As pointed out above, in order to validate a transaction, the miner must also check the balance of the Bitcoin addresses in order to ensure that a transaction does not end-up creating Bitcoins out of thin air. In the present implementation of Bitcoin, this validation is performed through a software component known as the UTXO database, the database of unspent transaction outputs.

Terabyte blocks represent 7 millions transactions per second. An optimized implementation only requires 2 reads and 2 writes per transaction to the persistent UTXO storage:

  • First read: check whether the transaction is even legit.
  • First write: If the transaction is legit, the address is marked as dirty with the fund removed.
  • Second read: If the transaction makes its way into the next block (produced by another miner), another check is performed to recheck correctness.
  • Second write: if the foreign block is correct, update the balance of the transaction.

Thus, the miner needs a sustained IOPS throughput of 4*7=28 millions IOPS. As every Intel Optane card offers 550,000 IOPS, the mining rig delivers a collective 140 millions IOPS, largely sufficient to sustain the throughput associated with 1TB blocks. Moreover, the rig has also spare capacity to catch-up after an outage.

Once again, sharding transactions against a fast hash is trivial, thus, implementing a Cassandra-like UTXO database is straightforward. Using Cassandra, Netflix had already done benchmark up to 1 million write / sec back in 2011 while Intel Optane delivers more than 50x the IOPS available back in 2011 through SSDs. Thus, there is no doubt that a specialized database could scale to 28M IOPS and more.

Then, beyond the IOPS, the miner also needs to ensure to have enough storage to store UTXO database. A compact binary encoding of the UTXO database requires:

  • 1 byte for flags (up to 8)
  • 3 bytes for the block height
  • 20 bytes for the Bitcoin address
  • 4 bytes for the "clean" amount in Satoshis (*)
  • 4 bytes for the "dirty" amount in Satoshis (*)

(*) There are only 21 millions Bitcoins, and each Bitcoin contains only 100 million Satoshis. Thus, the number of Bitcoin addresses that can contain over 4 billions (2^32) Satoshis (40 bitcoins) is limited to 550,000 addresses or so. This number of "super-rich" addresses is very small, and thus would be special cased in order to let the rest of the UTXO database benefit from a more compact encoding. In total, 32 bytes are needed per entry in the UTXO database.

With 256 nodes equipped of 750GB Intel Optane, there is enough storage to store 6e12 hot addresses, that is, 600 addresses per user considering 1e10 humans. Then, the HDDs of the nodes, which provide over 20TB of additional storage could be used to increase to the number of hot addresses to 6000 per human, while keeping more than half of the original storage capacity to spare for other needs.

In practice, both modern HDDs and the Intel Optane are performing 4KB block reads and 4KB writes at the hardware level (beware block reads and block writes should not be confused with the blockchain blocks). Thus, the most efficient strategy when writing would be to read a storage block, which contains 4096/32 = 128 entries and to evict the oldest entry, according to the blockchain block height.

Beyond, those hot addresses, the miner leverages its slower optical storage layer, which contains checkpointed copies of the full UTXO database. As it would take more than 100 days for all users to collectively touch more than their 6000 "hot" addresses, the full snapshots of UTXO database can be done rather infrequently, probably about one per month, the final tuning being dependent on the precise hardware specification.

Updating the UTXO database once a new block is found is also a non-issue. The mining rig has 32TB of RAM available, and this RAM can be used to keep the latest blocks in-memory while those blocks are being gradually written to the UTXO database. In particular, the amount of RAM is sufficient to cover the even rarest situations where a short dozen of blocks end-up being orphaned.

Scaling the block propagation

Once a miner has found a target hash, there is a strong incentive of quickly broadcasting the corresponding block, otherwise, another miner might win the mining race by broadcasting faster its own alternative block in the mean time. However, by the time a block is found, the bulk of its content, the transactions is already known to the other miners. Thus, the only information that needs to be transferred is a compact filter which points out the exact set of transactions that has been included in the block.

This mechanism is leveraged by Graphene, which reduces the amount of data that needs to be broadcast when a new block is found to a fraction of the original block size. Graphene demonstrates a compression factor of 186, which would bring down a 1TB block to 5.5GB. As the mining rig has a 50Gbps network connection, it will take less than 1 second to transfer the the full payload to a second miner, triggering an exponential cascade of broadcasts. However, it would be inefficient for the receiving miner to wait for the full payload to be received; the cascade of broadcast would usually start from the first "chunk" received. The Graphene payload would be chunked in smaller chunks, of say, 100MB.

Indeed, the economic interest of the miners is to always work on the latest block, thus if a miner claim to have found a new valid block and that the latest, say, 100 claims made by the same miner all proved to be correct, then it would a profitable assumption to put a limited trust into this miner and immediately start the cascade of broadcasts. Breaching this trust would not earn anything to the miner as its peers would still reject the faulty block within a minute. Worse, the bad behaving miner would immediately lose its hard-earned reputation, hence slowing down the propagation of its own future blocks, for tens of blocks, as the other miners would opt for the full prior validation. In practice, such a miner would most likely have to re-earn the trust of its peers by mining dozens of reduced blocks (faster to transmit), forfeiting most of the transaction fees to the benefit of its peers.

Through an early broadcast, and assuming that the Bitcoin network is comprised of miners with similar or superior internet bandwidth, the full broadcast of the 5.5GB to 10,000 miners is straightforward to achieve in 10 seconds or so, assuming that each miner starts propagating the fresh data upon reception of the first chunk, which would happen in less than 200ms no matter the distance between two miners on earth.

Conclusions

At this point, we have seen that a rig costing 1.3M USD a year in amortized costs is sufficient to support terabyte blocks. However, my hardware and bandwidth costs assumptions are wildly unrealistic. It will take at least 5 years from now for the Bitcoin ecosystem to reach the point where terabyte blocks are needed (onboarding mankind just takes time). Within 5 years from now, the hardware costs will have diminished - a lot.

Since the publication of the original Bitcoin paper 8 years ago, practically every cost quoted in this document have been reduced by a factor greater than 10. The cost of long term data storage is already anticipated to be divided by 3 by 2020. The bandwidth cost is also expected to decrease of 30% per year for the coming years as well.

Then, I am not accounting for any additional software improvements. Flexible Transactions and Schnorr signature could reduce the transaction size by more than 20%. Pruning the blockchain itself could probably halve the amount of storage actually needed.

Thus, within 5 years, it is conservative to assume that the amortized cost will only be 1/3 of my present estimate with a conservative mix of cheaper hardware and more efficient software. At this point, we would be reaching 400k USD/year of a rig capable of processing all the transaction that mankind will ever need (maybe not all the transactions that machines will ever need though, but that's a different scenario altogether).

For the average individual, 400k USD/year may feel like a huge amount of money, yet from a business perspective, this is a modest amount. In Paris, many well-placed boutiques are paying more than that for the rent alone. A small consultancy firm of 50 consultants, still in Paris, does also pay over 400k USD/year for their offices. Opening an IKEA store is considered being a typical 50M USD investment, twice as much as much as the mining rig presently considered. The investment cost associated to a small 10 turbine's wind farm would also exceed the cost of such mining rig.

While it is true that this cost represents an entry barrier, mining has been a highly specialized business with high entry barriers for years already. Impotent miners, nodes who do not mine blocks, do not add security to the network. The only option to decentralize further Bitcoin is not to wish for a downsize of miners, but to organize a massive expansion of the mining pie which will comparatively shrink every miner.

Tuesday
Jun042013

8 tips to turn your Big Data into Small Data

Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.

The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.

While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.

The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.

1. Get rid of everything that is not required

While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.

2. Turn dates in 16-bits integers when the time is not needed

A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.  

3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values

Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.

4. Replace strings by keys with lookup tables

The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.

5. Get rid of objects, use value types instead

Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.

6. Use plain arrays not "smart" collections

Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.

7. Use variable length encoding

The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.

8. Vectorialize listing when possible

Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.

Wednesday
Oct032012

Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • SAP HANA
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.

Monday
Jun252012

A few tips for Big Data projects

Floppy disk illustrationAt Lokad, we are routinely working on Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let's cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake ...

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs. 

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention. 

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

  • Nobody is going to have a look at that many numbers.
  • Opportunities exist to solve a better problem.
  • Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
  • If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won't enter (here) the debate SQL vs NoSQL, instead let's outline that whatever persistence approach is adopted, it won't help: 

  • figuring out if the problem is the proper one to be addressed,
  • assessing the usefulness of the analysis performed on the data,
  • blending Big Data outputs into user experience.

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it's the wicked parts of the problem that need the most attention.