
Dr Anoop Shah
Information from health records can be extremely useful in medical research, but at the moment not all the data can be extracted automatically. At University College London, Dr Anoop Shah is developing software to pull out more detail without compromising patient anonymity. By Michael Regnier.
How are health records used?
The General Practice Research Database (GPRD) has been collecting data from GPs’ electronic health record systems for 25 years. The information is anonymised and can be used for medical research, such as studies on drug safety. About 5 per cent of the UK population is covered by the database, so it has millions of patient records, many more than you could access by setting up a new research project.
Why isn’t all the data available?
GPs record major diagnoses using a system called Read Codes, which is standardised across the NHS. But doctors also enter information as free text: this could include specific symptoms, a suspected diagnosis or even a negative diagnosis. If a researcher wants to use the information in the free text, someone has to manually look at each record, anonymise it and pull out the relevant data. It’s not very practical.
How does your program work?
I have developed software that identifies pieces of free text that could potentially be coded in Read, even if they have been written using non-standard terms. It also detects the context of the diagnosis. For example, the coded term in a patient’s record might be “chest pain” but in the free text the GP may have written “not myocardial infarction”. Both bits of information are relevant and the program would pick them up while recognising that myocardial infarction was a negative diagnosis. On the other hand, the program deliberately omits any information that could identify the patient, such as names and locations.
Initially, we tested whether the program could detect causes of death recorded in free text. It was very successful – of diagnoses detected by the program, 98 per cent were correct. More generally, it can be harder because it has become common to include correspondence between GPs, hospitals and patients. These letters can have complex language structures that confuse the program, but we are continually developing and improving it.
What could it be used for?
If you are studying patients with a particular disease, you probably want to know whether they eventually died of that disease or some complication linked to their treatment. Much of that information will be in the free text. It may also contain suspected diagnoses, which cannot be recorded in the coded data.
With further development, this type of software could also help doctors to fill in records: analysing their free text in real time, the software could suggest standardised terms. If it wasn’t right, the doctor could rephrase the information.
What drew you to this problem?
I started developing the program when I worked at the Medicines and Healthcare products Regulatory Agency, which holds the GPRD. I’ve had an interest in computing for years but it had always been more of a hobby. Now I’ve combined it with my research.
My medical training was in clinical pharmacology and general medicine, and now I have a Wellcome Trust Research Training Fellowship to do a PhD looking at biomarkers and prognosis of coronary heart disease, mostly using electronic health records. This program is intended to be a resource for anyone but it would certainly help me as well.
This feature also appears in issue 72 of ‘Wellcome News’.

Further reading
Shah AD et al. The Freetext Matching Algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records. BMC Med Inform Decis Mak 2012;12:88.
Filed under: Biomedical Sciences, Data Sharing and Open Access, Fellowships, Wellcome Trust Publications Tagged: applications of technology, Dr Anoop Shah, electronic health records, GP, GPRD, heath records, medical language, medical terminology, Open Science, Research Training Fellow, Research Training Fellowship, University College London
