How to do ‘Large’ data testing

‘Big Data’ is the up-rising topic. As keeping reason of this out of the scope, I would like to jump on ‘how to test big data’. What does it mean? Why is it such a huge deal? What methods are there to use? Which tools are there use?

But before all of that, what do I mean by ‘Large Data Testing’?.

When you google ‘data testing’, results you get does not help you so; suppose you are responsible to test a system that has database. For whatever reason this database serves, developers will be writing code that is to manipulate it. In this context, i would like to divide data testing to two sub categories;

*Finding data disparities, faults and missing entries/fields *Testing the code that does data manipulation.

Data Faults Testing

Simply put: this is the after product of what have already happened to your test data or database. For whatever reason it has been effected with faulty software running on it and business desired outcome have been altered.

Most of the times; when you get a hold of this data, it is unknown what has happened or how it is effected. Some lousy programmer (no offence) have ran the not tested script/code on the data, fields are missing, overall data is missing or worse. I guarantee you that all your supervisor expects, will be the guy doing digging in the data and creating miracles in short time.

Needle in a haystack (you have no idea what to expect).

In such a case, you will be running queries, scripts on the data. You might even program your own code to check the data issues.

But, I will have to stop here for a second and let you know this is not the scope either. I will be explaining one exercise on this topic in a later stage where I am writing checker scripts for MongoDB in JavaScript. I will be explaining my mind-set and the tools I used for it.

But for now, let’s move to the main issue;

Data Manipulation Code Testing

Let me paraphrase: Testing the code that does data manipulation.

Our lousy developer (still no offence) again at his/her best writing code as if he will not be the one that responsible. Our task is to make sure the written manipulator is doing what it is suppose to be doing. In addition, it is not corrupting data. The code might be doing insert / update / delete / merge duplicate / clear redundant data. So since I outlined the story-card at hand enough, let’s move to solution;

Requirement Gathering

Agile methodology practice: Story-cards, describes desired behavior of the system under development. Here I will assume basic knowledge of agile;

Story-card you will be testing, might only explain what the code should be doing in very brief terms. They are usually just a sentence that expresses whom wants the feature, what the feature is and why they want it. Product Owner (assuming it is Scrum) explains the desired behaviour. Here is an example story;


Above example does not have acceptance criteria specified. Your Product Owner might be kind enough to write acceptance criteria as well. Even in that case you suppose to do your own research.

Ask questions such as;

*Is it certain that newer timestamps have newer Address associated with db entry? *What are the rules that govern merge action? *What are the error codes for wrongly selected db entries for merge action? *What is that worse that can happen?

Be aware with the last question. It is possibly the most important and frequently forgotten to be asked. Case in point: iPhone 4 reception fails while everything else working, in a phone!

Criticise the requirement first; make sure you always remember that testing does not start when the code is at testing phase. Testing starts as the idea of an new story-card (feature, requirement) considered to be developed. Hmm, here you should ask: When does it end? (This is an very tricky question.). Well, drop it for now.

Outline all the rules you gathered from any source: product owner, customer, developer, ticket discussions, verbal communication (never omit). You should not be afraid of asking questions about the end-points that you are not certainly clear. Because if it is not documented in anywhere. It is likely that ‘Product Owner’/’Customer’ also did not think about it. Agile developer has to give some decisions on the fly while developing the code. Only way to grasp these decisions is to communicate well. Even for such decisions of developers will not be eager to write documentation.

Create your own decision tree / flow chart / end-points. These will be your scope for test case design.

Test Design Pattern

So the most important suggestion of this article;

Create Your Own Data

Common practice is to run code on QA-data and observe the results. This is not going to cut it. There are thousands of entries in data. Which cases you check will guarantee tests success or the code is doing what it is suppose to be doing while not doing what it is Not suppose to be doing?

Start with the basic flow; make sure that you have all the pre-requirements known and documented.

I would suggest you document ‘Input’, ‘Output’ and ‘Desired Behaviour’ in a plain-paper. Visualise the User Story, requirement. Acceptance conditions: type it your-self. With the data-testing there might be many corner cases, visualising and increasing awareness of yourself is critical.


There is no ‘usability’ testing here which makes input-output your core subject. When I say input, it is not an value rather it is an input which will create the corner case of your data.

Acceptance in the ticket (if there is any) will not going to include these data specific corner cases. Example; database have entries without ‘title’ (still valid), database have ancient entries without some fields which ‘code under test’ is going to use. An Product Owner does not have to mention these corner cases in the ticket. These cases are the actual reasons which complicate things.

Case Study;

Use case;


Let’s interpret this in paper as controlled inputs, desired outputs.

– There is an database of products and those have authors in them.

– These authors have two fields at least ‘id’ and displayName.

– Some of the displayNames are empty. ‘displayName’ is probably a combination of ‘firstName’ and ‘lastName’.

– Delete the related container product if there is no author left in product after deleting the empty displayName author/authors.


- displayName:  '' ''

since I know that display name is a combination of ‘firstName’ and ‘lastName’, and this is an english database, my data will have cases like ‘Bolsu, Serhat’. Then if the displayName is empty, it is likely that firstName and lastName fields are empty too. In which will create cases like following in database;

- displayName: '',''
- displayName: '' ,''
- displayName: '', ''

above is the part that Product owner will not mention but a tester that knows the database well, should consider type of thing.

List the conditions depending on input;

- 1 Author with empty displayName ( always start with basic )
- 1 Author with various cases of displayName ( some listed above )
- More then 1 Author with empty displayName
- More then 1 Author in which 1 have empty displayName while others are okkay.


Next step will be preparing the data for these conditions.

Test Execution

The rest is actually trivial: load the controlled input, execute the code under test, evaluate the output. Only point that you need to be careful is that that code might be giving desired output, however, it might be corrupting other data that should not be touched. Make sure you add control cases to avoid this.

You might choose to gather more then one case in to same data sample in terms of quickness in testing. But be careful that your test cases does not effect each other since the data manipulation script will be running on all data.