Need for transparent and robust response when research misconduct is found (another example)

On the 22nd of February, Dorothy Bishop published an open letter to CNRS Need for transparent and robust response when research misconduct is found signed by leaders and activists in the field of scientific integrity, which has now a response from the CEO Antoine Petit. That letter was prompted by a case in which I am the whistleblower and I hope to be able to say more about certain aspects of this story in the future.

It is however important to note, as the letter indeed clearly does, that the need for a more transparent and robust response is hardly limited to CNRS. The experience of Clément Fontana unfortunately illustrates that other research organisms in France have also work to do. Clément Fontana discovered his name on an article which had been published without his agreement a couple of years after he had left a post-doctoral position. He also found serious errors in a second article from the same group, article which turned out to be the basis of a successful grant application. His experience challenging those two problems is less than encouraging as you can see by reading his public letter [French]. It may be that in his case the complexity of the organisation of research with multiple institutions sharing the control and management of a single laboratory, a characteristics of the French research landscape, has made it easier for each of those institutions to evade their responsibilities.

What do scientific prizes celebrate?

Another scientific prize has been awarded to Chad Mirkin for his work on oligonucleotide-modified gold nanoparticles (or Spherical Nucleic Acids, aka SNAs, as Mirkin calls them since 2012)… whilst SNA company Exicure continues on its “death spiral“. The prize is the 2023 King Faisal Prize (KFP) in Medicine and Science.

The KFP and Northwestern press releases show a certain disconnect from reality. They specify that:

SNAs can naturally enter human cells and tissues and cross biological barriers that conventional structures cannot, making it possible to detect or treat disease on the genetic level while leaving healthy cells untouched. They are the basis for more than 1,800 commercial products used in medical diagnostics, therapeutics and life science research.

It is debatable whether there is anything special about the way oligonucleotide-modified gold nanoparticles enter cells, penetrate tissues and cross biological barriers, but what is sure is that this property is not currently used to detect or treat any disease, nor are there any commercial products in therapeutics and life science research based on this supposed crossing of biological barriers.

The 1,800 commercial products is a reference to the SmartFlares, which have been discontinued already 5 years ago because they did not detect mRNAs in live cells. The company that develops the biomedical applications is on the verge of bankruptcy. In 2018, after a clinical trial, Purdue Pharma notified Exicure that “it declined to exercise its option to develop AST-005” [Anti-TNF compound against mild to moderate psoriasis]. In 2021, a research improprieties case in the Friedreich’s ataxia program led to the winding down of that program and also of the immuno-oncology program for cavrotolimod (AST-008, a clinical trial was ongoing and had reported some interim results). At the end of 2022, partnerships with pharmaceutical companies Ipsen and Abbvie were unravelling.

In Northwestern’s press release, Mirkin is quoted as saying that “The prize is extraordinary validation for the strategic bet Northwestern made on nanoscience two decades ago, …”. Indeed, Northwestern University even bought shares in Exicure. Maybe the prize money can be used to compensate that financial loss?

As readers of this blog know, I have been raising questions about SNAs here and elsewhere since 2015. With the risk of being called a scientific terrorist again, I shall continue.

Is it somebody else’s problem to correct the scientific literature?

Last week was rather eventful, starting with 3 days at the French Society for Nanomedicine conference in Strasbourg with several members of the NanoBubbles project (including Nathanne Rost and Maha Said who presented posters on post-publication peer review and replications in bionano) and many interesting discussions. I skipped the Tuesday morning sessions to give a seminar at the Institut Charles Sadron, where, 20 years ago, I defended my PhD. The event was recorded and you can now watch it on YouTube (below). The conference also coincided with the publication (Monday afternoon) in the French Newspaper Le Monde of an investigation of a major integrity case in the laboratory that I had joined two years ago. I am the whistleblower. I am quoted in the article noting that 20 months after reporting this case, most of the articles have not been corrected nor retracted. Indeed, of the 23 articles for which I reported concerns, nine have been “corrected”, one has been retracted and thirteen remain untouched. Remarkably, the retracted article has been republished in a predatory journal. The corrections are problematic both for technical reasons (e.g. this one or this one) and because of the lack of transparency (the editors receiving the correction requests and the readers reading those corrections would have been unaware of the real reason why this correction was necessary in the first place). The question of how and when corrections are appropriate in cases of breaches of research integrity would need to be further explored by journals and research institutions. PLoS One’s conditions to publish a correction includes the requirement that there are no concerns about the integrity or reliability of the reported work. Other publishers have much more ambiguous policies. In any case, given the current lack of appropriate action 20 months after my initial reports to institutions, I have now posted on PubPeer my concerns in full so that everyone can make up their own mind and the authors can respond if they wish to do so.

Spherical Nucleic Acids company Exicure “in survival mode”

In 2011, Mirkin co-founded (with Shad Thaxton) the biotechnology company Exicure to develop the biomedical applications of spherical nucleic acids (SNAs). After an internal research fraud (misreporting of preclinical data) and at least two clinical trials failing to deliver encouraging results, the company seems unlikely to survive 2023. In 2018, Chad Mirkin called me a scientific terrorist for raising questions about the SNA technology.

Is the Lancet complicit in research fraud?

Devastating account of the Lancet complicity in keeping on the record article that they have known for years are fraudulent. And which have caused deaths.

Dr Peter Wilmshurst

This blog was written jointly by Patricia Murray, Professor of Stem Cell Biology and Regenerative Medicine, University of Liverpool, UK and Peter Wilmshurst.

The editor of a medical journal that charges readers for access to articles whilst knowingly keeping fraudulent articles on its website is as guilty of financial fraud as an art dealer who knowingly sells forged artworks, but there is no moral equivalence. The complicity in fraud by the editor of the medical journal may also cause death and harm to patients.

In 2008, the Lancet published “Clinical transplantation of a tissue-engineered airway” in a patient with post tuberculous stenosis of her left main bronchus (Macchiarini P, Jungebluth P, Go T et al.)1. The Lancet also published the five year follow up results of the same patient (Gonfiotti A, Jaus MO, Barale D et al. The first tissue-engineered airway transplantation: 5-year follow-up results.)2.


View original post 3,781 more words

Letter from Peter Wilmshurst to UCL President & Provost Prof Spence

Following the publication of our letter in the BMJ “Time to retract Lancet paper on tissue engineered trachea transplants” (doi:, published 02 March 2022), Peter Wilmshurst has written to UCL President & Provost Prof Spence. I reproduce his letter with his authorization. It is (or will shortly) also be cross-posted on Leonid Schneider’s blog.

4 April 2022,

Dear Professor Spence

On 18 May 2021 I wrote to you “to enquire what University College London (UCL) has done and plans to do about the fact that Professor Martin Birchall is a co-author of an article (Macchiarini P, Jungebluth P, Go T, et al. Clinical transplantation of a tissue-engineered airway. Lancet 2008;372:2023-30.) that has not been retracted despite being fraudulent.” At that time I presumed that you had enough integrity to realise that a fraudulent medical research article that was resulting in harm to patients should be retracted and enough common sense to realise that a cover up would harm the reputation of UCL. Since then the BMJ has published a letter from colleagues and I calling for retraction of the paper. The BMJ editors and lawyers checked the supporting evidence before publication. I am attaching a copy of the letter.

On 9 July 2021 UCL informed me that Professor Pillay had been appointed to investigate my allegations under UCL’s Research Misconduct Procedure. On 2 November 2021, UCL informed me that Pillay had decided: “The allegation is to be dismissed on the grounds of the substance of the concerns having been considered previously for the following reasons: As you know UCL has conducted a Special Inquiry into Regenerative Medicine at UCL and the Inquiry report was published in September 2017 which made a number of recommendations. The paper in question has been scrutinised by the Inquiry as well as two internal reviews at UCL, a House of Commons Select Committee as well as reviews by the Lancet itself. After careful consideration, I do not consider that you have submitted any new substantial evidence that alters the substance of the allegations that have already been addressed by all these various reviews.”

The new substantial evidence is the correspondence from Professor Castells in early 2018, in which he said the airway collapsed three weeks after the transplantation and it needed to be stented. That means that the main claim of the paper that the graft “had a normal appearance and mechanical properties at 4 months” was false. In addition, Castells said that the claimed improvement in lung function was also untrue. Therefore Pillay’s claim that I had not submitted any new substantial evidence is spurious, at least as far as UCL is concerned. Before 2018, the integrity of the 2008 Lancet paper had been doubted and the ethical basis had been questioned by distinguished surgeons, who pointed out that it was unethical to subject a patient to high risk experimental surgery without prior research demonstrating efficacy in an animal model, particularly when alternative conventional surgery was eminently feasible. The 2018 correspondence from Castells alters the substance of the allegations because it provides incontrovertible proof that the main claims in the paper are false and, because there is no possibility that this could be the result of inadvertent error, the correspondence is conclusive proof of fraud. UCL confirmed that the House of Commons Select Committee that Pillay referred to is the Science & Technology Committee.

There are a number of things that show that the Science & Technology Committee was very concerned that the 2008 Lancet paper has not been retracted. For example:

1. Sir Norman Lamb, who was then Chair of the Science & Technology Committee, added signposts in the previous Committee Report on Regenerative Medicine (2017) to alert readers to incorrect information about the 2008 Lancet paper. The subsequent Report on Research Integrity (2018) refers to the 2008 paper and some of the subsequent follow up research by Professor Birchall at UCL (i.e. RegenVOX). It says “misconduct processes have revealed that the research on using stem cells to support artificial trachea transplants is not reliable, and is based on exaggerated patient outcomes (see Box 2). The ‘RegenVOX’ clinical trial of stem cell-based tissue-engineered laryngeal implants referred to above is now listed
as ‘withdrawn’ on the website. Having explored the issue of correcting the research record with our witnesses, we resolved to find a way of flagging the now contested evidence that the Committee received to readers of its report. We have arranged for a note to be attached at the relevant places in the online report with a forward reference to this inquiry. Our intention is to help readers of that earlier report to find further relevant information, not to alter the formal record of our predecessor’s work.

2. After Sir Norman became aware of the email correspondence from Castells in 2018, he sent letters to the Lancet asking the journal to consider retraction of the 2008 paper. Our BMJ letter quotes from Sir Norman’s letter dated 7 March 2019 and UCL has previously been sent a copy of the letter. Sir Norman clearly believed that the emails from Castells were compelling substantial new evidence.

3. Sir Norman also appeared on BBC Television’s Newsnight programme questioning the Lancet’s failure to retract the paper. The link is The first part of the clip deals with Shauna Davison who died the day after her discharge from Great Ormond Street Hospital when her trachea collapsed and she suffered asphyxia. The second part shows Sir Norman being interviewed on the Newsnight programme.

4. The Medical Research Council has a timeline of “Leading research for better healthcare”. It originally had two major advances for the year 2008. One was “2008 First stem cell-based windpipe transplant conducted”. In early 2019, Sir Norman criticised the MRC for refusing to remove that entry when the Lancet 2008 paper was known to be false. In May 2019, the MRC removed the 2008 paper from the timeline. Below are links to the current timeline and the one that was on the MRC website in early 2019

5. Sir Norman is no longer a Member of Parliament, but he has seen our BMJ letter and amongst his other comments, he said that the failure to retract the paper “beggars belief”.

A question remains whether correspondence from Castell was new substantial evidence as far as UCL was concerned. Pillay maintains that the correspondence from Castells does not add to the scrutiny by the Special Inquiry into Regenerative Medicine at UCL and two internal reviews. The Report of the Special Inquiry was published in September 2017. That was before the date on which the fraud was confirmed by the correspondence in May 2018. So obviously the Report does not mention the correspondence between Castells and the Lancet. However, it does raise other concerns about Birchall. For example, it points out that the cell preparation in Bristol took place in a building not licenced under the Human Tissue (Quality and Safety for Human Application)Regulations 2007. The Regulator made a decision not to prosecute the Bristol team for the breaches of the regulations. In addition there is evidence that the four day incubation of the donor trachea with so-called “stem cells” started in Bristol, because the trachea was transported to Barcelona on 10th June, only two days before the operations. Accordingly the trachea should have been classed as an Investigative Medicinal Product in the UK, requiring regulatory approval from the MHRA, but no approval was obtained. Birchall’s attitude to regulations designed to protect patients is illustrated by his statement about the 2008 Lancet paper in an interview to Vogel (Science, 19 April 2013, volume 340, pages 266-8) “We ran rough-shod over regulations – with permission.” In fact there was no permission. He also said “It wasn’t done to the highest possible standards.”

UCL has refused my Freedom of Information requests for the reports of the two internal reviews that Pillay referred to. The reason given by UCL is “Whilst recognising that there is a strong public interest in this area of research, there is also a need for a safe space away from external influence in which allegations of research misconduct can be reviewed and decisions taken. If there was an expectation that these discussions would be disclosed to the public this would inhibit free and frank discussion and would lead to poorer decision-making. For this particular process, the need to ensure robust decision making is considered significant to maintain the integrity and effectiveness of the process itself.” While the Information Commissioners Office is considering my appeal against UCL’s decision, I made an FOI request for UCL to answer the following questions:

1. Was one of the internal reviews that Professor Pillay referred to titled “Allegation of research misconduct against Professor Martin Birchall, Professor Paolo Macchiarini and Professor Alexander Seifalian. Report of the Screening Panel”, which resulted from allegations made by Professor Pierre Delaere in January 2015?
2. Was one of the internal reviews that Professor Pillay referred to titled “Allegations of research misconduct against Professor Martin Birchall from Professor Patricia Murray. Report of the Screening Panel” with the report dated December 2018?
3. If one or both of the two reports mentioned in questions 1 and 2 were not the reports of the
internal reviews that Professor Pillay was referring to, what were the titles of the reports, when were they completed, who made the complaints that resulted in the internal reviews and when did UCL receive those complaints?

UCL has refused to answer those questions for essentially the same reason that it refused to provide
the reports of the internal reviews. UCL said “Whilst recognising that there is a strong public interest in this area of research, there is also a need for a safe space away from external influence in which allegations of research misconduct can be reviewed and decisions taken. UCL relies on individuals coming forward with complaints of academic misconduct, which they may be less likely to do if they thought the fact they had made a complaint might be made public.”

The reason given by UCL for being unwilling to provide a “yes / no” response to questions 1 and 2 is incomprehensible, because I already know the names of those internal reviews, I have copies of both reports and one of the reports is available on the internet. If these two internal reviews are the ones that Pillay was relying on, it calls into question his judgement and his integrity, because neither considered the evidence from Castells. In addition, subsequent events show that the two internal reviews provided false reassurance about the integrity of UCL employees, which raises additional concerns about the rigor of UCL’s internal review processes. Therefore it is worth considering the findings of the two internal reviews that I believe Pillay was referring to.

In his complaint, Professor Delaere alleged misconduct by Birchall and Professor Seifalian, who were
at the time employed by UCL, and by Macchiarini, who had left his honorary professorship at UCL by
the time the report was produced in late 2015. That was before the proof of fraud from Castells became available in 2018. So the report did not consider that evidence. The three UCL professors had been a co-author of a 2011 Lancet paper (Macchiarini P, et al. Tracheobronchial transplantation with a stem-cell-seeded bioartificial nanocomposite: a proof-of-concept study. Lancet 2011;378(9808):1997-2004). In addition, Birchall was named as senior author and Seifalian was a co-author of a 2012 Lancet paper (Elliot MJ, et al. Stem-cell based, tissue engineered tracheal replacement in a child: a 2-year follow-up study. Lancet 2012;380(9846):994-1000). In paragraph 17 of the report of the 2015 UCL internal review (screening) panel it was “noted that the published report on the 2011 synthetic tracheal transplant case, which had included Professor Seifalian as a co-author, was one of six published articles that had been reviewed by four surgeons at the Karolinska University Hospital and cited by them in their allegation of scientific misconduct against Professor Macchiarini on the grounds that the results published by him as the lead author did not appear to correlate with the patients’ actual clinical outcomes. However, the Panel noted that no reference had been made to Professor Seifalian in this allegation, and it determined that there was no prima facie evidence to suggest that Professor Seifalian could be held to account for any of the major inconsistencies or inconsistent and omitted clinical information that had been highlighted by
the Karolinska surgeons in their report.”

The 2011 Lancet paper has now been retracted because it was fraudulent. Professor Seifalian manufactured at Royal Free Hospital / UCL some of the plastic trachea that were supposedly “seeded with the recipients’ stem cells” before they were implanted by Macchiarini when he was working at the Karolinska Institute – the plastic tracheas were not made to GMP (Good Manufacturing Practice) standards. Professor Seifalian was dismissed from UCL on 15 July 2016 for misconduct during his collaboration with Macchiarini. The 2015 screening panel “determined that there was no prima facie evidence that any research misconduct . . . . had taken place, but that there was nevertheless some substance to (Delaere’s) claim that there was a misleading element within the 2012 Lancet published report which had
included Professor Birchall and Professor Seifalian as co-authors – namely with regard to the two figures within the report . . . . These figures had in the Panel’s view not given sufficient emphasis to the presence and possible contribution of the stent and omentum tissue wrap in the recovery of the child patient. Furthermore, the Panel felt that none of the evidence presented by Professor Birchall in this published report in fact serve to demonstrate that the addition of stem cells to the transplanted tracheal scaffold used in the patient case concerned played any therapeutic role in the functioning of the trachea and that none of the effects that were demonstrated in these published reports could be directly linked to the beneficial effects of stem cells

In addition, the 2015 screening panel “felt that Professor Birchall should be urged to give greater consideration to the need for clearer and more representative presentation of information and evidence in his published reports in order to support his assertions, to allow transparent and complete judgement by the scientific community, and to avoid exposure to further allegations of research misconduct, for example the presentation of misleading information, that might jeopardise his future research efforts and subject both himself and UCL to reputational risk. To this end, the Panel felt that Professor Birchall would be well advised to seek to check some of his assertions and the way that these were presented in his published reports with other senior colleagues and collaborators outside the co-authorship of his publications.

If Macchiarini was solely responsible for the false claims about airway transplantation in the 2008 paper and Birchall, Macchiarini’s co-principal investigator, were blameless, how is it that the 2012 paper made misleading claims about tracheal transplantation when Birchall was its senior author and Macchiarini was not even a co-author? The complaint from Professor Murray raised further concerns about publications by Birchall, but they were unrelated to the 2008 Lancet paper. Murray’s complaint was also before the information from Castell’s was known. Although the internal review screening panel’s report was produced after Castell’s correspondence with the Lancet, the screening panel’s report does not mention either the 2008 Lancet paper or Castell’s correspondence.

Murray alleged use of the same images in two separate publications, which had different methods, and deliberate misuse of research findings to support an application for ethics approval. Birchall admitted six images in one paper should not have been used because they related to animal experiments in a different paper. Birchall blamed this on a mistake by a former UCL PhD student and “the scientist overseeing the publication”. Birchall also admitted inaccuracies in a PhD thesis and errors in a research ethics committee application.

In addition, I understand that UCL refused to investigate more serious allegations and said that University College Hospital and Great Ormond Street Hospital should investigate those. One of the more serious allegations was that Birchall and Professor Lowdell knew from work undertaken by their PhD student that freeze-thawing the trachea significantly weakened the structural integrity, making it more likely to collapse. But this information was omitted from all papers and applications for ethics approval from UCL. Failure to take this into account was the reason that Shauna Davison’s trachea collapsed on the day after she was discharged from Great Ormond Street Hospital and, as a result, this 15 year old child died from asphyxia.

From these documents, I do not gain the impression of an aberrant medical researcher. Rather I see
a departmental culture of dishonesty and poor practice that UCL is trying hard to conceal. Therefore it is difficult to escape the conclusion that the reason UCL will not provide answers to my FOI questions is that those internal reviews did not consider the 2018 correspondence between Castells and the Lancet. If I am correct, disclosure of the information will confirm that Pillay has fabricated a spurious reason to avoid investigating the research fraud involing Birchall. I believe that if all the facts came to light, UCL would have to explain:

1. Why it employed Birchall and gave an honorary contract to Macchiarini on the basis of their fraudulent Lancet paper.

2. How enthusiasm for bogus science was used to justify lethal experimental surgery on young
patients at hospitals associated with UCL.

3. How large amounts of publicly funded grants were taken by UCL for research predicated on

I would like to know what UCL is going to do about this scandal and about the apparent attempt at
cover-up by Pillay.

Yours sincerely

Peter Wilmshurst

University Responsibility for the Adjudication of Research Misconduct, by Stefan Franzen

Stefan Franzen is a Professor of Chemistry at North Carolina State University. He is also a whistle-blower in a case of research misconduct that, eventually, after 10 years, led to the retraction of a 2004 Science article entitled “RNA-Mediated Metal-Metal Bond Formation in the Synthesis of Hexagonal Palladium Nanoparticles.

What he learnt about research misconduct, he learnt it the hard way.

Yet, whilst his personal experience of this specific controversy informs and nourishes the narrative, University Responsibility for the Adjudication of Research Misconduct is an academic book that has a much broader scope and ambition as illustrated by the table of content:

  1. Evolution in a Test Tube
  2. The Clash Between Scientific Skepticism and Ethics Regulations
  3. Scientific Discoveries: Real and Imagined
  4. The Corporate University
  5. The Institutional Pressure to Become a Professor-Enterpreneur
  6. The Short Path from Wishful Thinking to Scientific Fraud
  7. University Administration of Scientific Ethics
  8. Behind the Façade of Self-Correcting Science
  9. The Origin of the Modern Research Misconduct System
  10. Sunshine Laws and the Smokescreen of Confidentiality
  11. The Legal Repercussions of Institutional Conflict of Interest
  12. Bursting the Science Bubble

I encourage you to read the book. Here I want to discuss a specific point, which Stefan Franzen considers in particular in Chapter 10: the issue of the confidentiality of integrity investigations. In short, Franzen argues that confidentiality is bad for the whistle-blower, bad for the person(s) whose work is questioned, but convenient for institutions that may want to use it as a smokescreen to limit damages to their reputation.

Let’s start with a quote of the first sentence of Chapter 10:

The contradiction between the confidentiality practiced by adjudicating institutions and the public nature of academic science causes disruption of every aspect of research misconduct investigations. Prior to adjudication, allegations would best be kept from public view, but this can rarely be achieved in a collaborative research setting. The difficult problem of reigning in rumors or protecting informants and respondents from repercussions to their careers is often ignored by university administrators, even though the purpose of the NSF OIG [National Science Foundation Office of Inspector General] confidentiality regulation is to protect the individuals involved.

Later Franzen notes that in most cases the number of people who are in position to file an allegation is small (e.g. collaborators or competitors, who may have expressed concerns in the past) and that they are therefore easy to identify. He shows how, in his case, confidentiality was used to prevent him (or others with relevant expertise) from accessing relevant elements of the investigation (e.g. lab books or data that could have settled the case very quickly), but did not protect him as a whistle-blower: “In the hexagon case, everyone in the academic departments of both universities involved and many in the university administration knew who was involved in the case from the beginning” (In France, Rémi Mossery, who is the integrity lead for the CNRS, says that ~50% of the integrity cases that are reported to him are “collaborations that ended sourly”).

Franzen also considers the case of accusation against more junior scientists (e.g. PhD students or post-doctoral researchers) where confidentiality indeed could serve a purpose of protection of a vulnerable researcher, but where it also often serves to protect the supervisors from scrutiny in cases where mentoring problems may have contributed to the situation.

One further problem (alluded to in the first sentence of the chapter cited above) is the tension between correction of the scientific record and the determination of eventual sanctions. What is the priority and focus of integrity investigations? Is it to clarify and eventually correct the science or is it to determine the seriousness of wrongdoings and propose appropriate punishments. Do these two goals go hand in hand, or, to the contrary, would prioritising one or the other lead to rather different procedures, in particular when it comes to openness versus confidentiality? It is my personal impression (from my reading and the cases I am involved in / I have been involved in) that correction of science is not the priority in such investigations and that indeed confidentiality hinders correction of science too.

What is your experience (anonymous replies allowed )?

Down the rabbit hole of the Limit Of Detection (LOD)

This is a guest post by Gaëlle Charron, Maîtresse de conférences at Université de Paris.

In a post about SERS sensing hosted on this blog, I complained about LODs being often reported below the concentration range in which the sensor displays a linear signal vs. concentration response. Wolfgang Parak reacted here: he thinks this is not an analytical error. This is a good discussion to have, one that I meant to formalise for years to introduce the concept to my students. Wolfgang gave me the decisive incentive, and for that I thank him. In the following, I will go into full tutorial mode for that reason. Feel free to skip some parts if you feel offended, or to reuse the material if you find it useful. To help you navigate this post, here is a rough outline of it:

Wolfgang stated the following definition for the LOD:

I thought the typical definition of the LOD is the concentration in which the detection signal is at least three times bigger than the noise in the signal. This is an “all or nothing” response. At the LOD you can tell that “there is something”, but you can’t necessarily tell how much. The range of the linear response is much harder to achieve. 

I agree. This is actually the recommended IUPAC definition of the LOD.

The limit of detection (LOD), expressed as the concentration cLOD or the quantity qLOD is derived from the smallest measure yLOD that can be detected with reasonable certainty for a given analytical procedure. The value of yLOD is given by the equation

Where yB is the mean of the blank measures, sB the standard deviation of the blank measures and k is a numerical factor chosen according to the desired confidence level (note that the original notations have been modified to be in line with the ones used below).

Generally, a value of 3 is chosen for k; it corresponds to a 93% confidence level. But more on that later.

Let’s dive into the statistics that underpin this definition, or skip it if you want. Say you acquire several measurements of a blank sample and of analyte samples, for instance by recording replicate absorption readings of a spectrophotometric cuvette filled either with pure water or solutions of the analyte. Let’s focus on the blank sample. Because of the natural dispersion of measurements, you will not get the same readings each time. Actually, if you have acquired a large number of measurements, the frequency distribution of the readings will have a bell shape, that of a Gaussian distribution characterised by the mean signal of the blank, yB and its standard deviation sB (if you have acquired less measurements, say 10, it will look a lot more like ASCII art).

If you perform an extra measurement, there is a 15.9 % chance that it will give a reading above yB + sB because of the properties of the Gaussian distribution:

adapted from g/wiki/File:Standard_deviation_diagram.svg

There is only a 2.2% chance it will give a reading above yB + 2sB (point P). Therefore if you blindfold yourself, pick a cuvette on the cuvette rack, somehow manage to perform a measurement without ruining your shoes and obtain a reading of yB + 2sB, the odds that it is the blank sample are only 2.2%. In other words there is a 97.8% chance that what you did measure while blindfolded was not the blank sample but an analyte sample. yB + 2sB would be a nice cut-off value to avoid claiming the presence of an analyte when in fact it is absent, namely to avoid reporting a false positive. When the signal is above this value, you have a 97.8% probability of being right when claiming it is an analyte sample.

Let’s temporarily pick yB+2sB as a cut-off value to discriminate between the blank sample and the analyte samples. A reading below that value is assigned to the blank cuvette, a reading above is assigned to an analyte cuvette. This will efficiently avoid false positive (with 97.8% confidence) but will lead to plenty of false negatives.

Indeed, say one of the analyte samples has a true mean signal of exactly yB + 2sB. Upon acquiring lots of replicate measurements of that sample, you will also get a Gaussian frequency distribution of the readings. For the sake of simplicity for the moment, let’s assume that it will have the same width as that of the blank sample, ie. the same standard deviation. Let’s pick one of those measurements at random. There is a 50% chance that it is greater than yB + 2sB. Upon applying the yB+2sB criterion, that measurement would have been correctly assigned to an analyte sample. Let’s draw again. This time the value is lower than yB+2sB, the chances of it were also 50%. Applying the yB+2sB criterion, one would incorrectly assign that measurement to the blank sample. One would therefore be wrong and report a false negative with 50% probability. That yB+2sB criterion is not so good after all.

Ideally, one would like to avoid both false positive and false negative efficiently. It is then better to pick a cut-off signal value further away from the mean of the blank sample. Let’s put that new cut-off twice as far as previously, at yB + 4sB. Let’s also put an analyte cuvette with a true mean reading of exactly yB + 4sB on the cuvette rack, along with the blank cuvette and let’s put the blindfold back on. You pick one of the 2 cuvettes, press measure and you get a signal of yB + 2sB, below the cut-off of yB + 4sB. In claiming that the mystery cuvette is not that of the analyte sample when in fact it is (false negative), you have a probability of being wrong of only 2.2% because that reading of yB + 2sB is 2 standard deviations away from the true mean of yB + 4sB of the analyte sample. In claiming that it is not the blank when in fact it is (false positive), you have a 2.2% chance of being wrong because that reading of yB + 2sB is 2 standard deviations away from the true mean of yB of the blank sample. At point P exactly, the signal is as likely to arise from the blank than from the analyte. But as soon as you diverge from P, one cuvette assignment becomes markedly more likely than the other. P is therefore called the decision point.

How can you put it to use? Let’s take a huge cuvette rack with 200 groves in it. And let’s put 100 cuvettes of the blank sample and 100 cuvettes of an analyte sample with a true mean signal of yB+4sB in it, in a random order. With the blindfold on, let’s measure these cuvettes and sort them out according to the readings: less than yB+2sB, the cuvette goes to the “blank” rack, more than yB+2sB, the cuvette goes onto the analyte rack. Once the blindfold is off, we will find ourselves with a blank rack with 2 or 3 analyte cuvettes misplaced (false negatives) and an analyte rack with 2 or 3 blank cuvettes misplaced. Not bad, isn’t it?

In general, a cut-off of yB + 3sB is chosen instead of my personal pick of yB+4sB. The decision oint P is therefore 1.5 sB away from the blank and from the analyte sample having the smallest signal that can be confidently distinguished from that of the blank. At that point P, the probability of reporting the absence of the analyte when it is in fact present is about 7% (have a look at this standard normal distribution quantile table). The probability of reporting the presence of the analyte when it is in fact absent is also 7%. Overall, in choosing a mean signal of yB+3sB as the smallest measure that can be distinguished from the blank, one takes a 7% risk of being wrong when linking the signal to the presence or absence of the analyte. The presence or the absence of the analyte, that is the all or nothing that Wolfgang was referring to.

How should that limit of detection be practically determined? The definition we have explored above relies on the knowledge of the true mean value of the measurement of the blank yB, of its true standard deviation sB and likewise of the true mean of the measurement of the LOD sample yLOD. And when I say true mean, I mean enough measurements for the frequency distribution to actually look like a bell and not like a quick and dirty lego tower. The above definition also relies on the assumption that the measurements of both blank and LOD samples will display the same standard deviation. If the standard deviation of the LOD sample is smaller than sB, then at point P, the rate of reporting a false negative will be smaller than 7% so it would not impair the efficiency of the discrimination. However if sLOD > sB , then at point P, the rate of false negative would be higher than 7%. Higher to the point of being inacceptable? That depends on how much greater sLOD is and also on the application. But as a safety margin, it would be better not to assume anything about sLOD and just to measure it by recording plenty of readings. Back to the practicals, on a spectrophotometry case example, one would have to perform repeated measurements of the blank cuvette. One would then have to prepare cuvettes of analyte solutions at different concentrations and make repeated readings of them to try and pinpoint by trial and errors the concentration giving rise to a mean signal of exactly yB + 3sB (for a confidence level of 93%). One would have to accurately measure its standard deviation to ascertain the actual confidence level. That concentration would be then be the LOD. There, one should reward all these efforts with a nice cup of tea.

So how is the LOD usually determined? In many, many instances, in most instances actually, it is not done in this way. I took a good look at how it was done in 5 of the papers that were highlighted in the review which was the object of my initial post, 5 random picks out of the 9 mercury detection papers that reported LOD below the lower limit of the concentration range in which a linear response was observed. I don’t think it useful to name names here. I did find problematic data treatment in 4 of them, and one minor problem in the last one. The mere fact that I could so easily find 4 papers with major analytical problems in them in a table that contains 15 references speaks volumes about our collective issue with good analytical practices.

  • Paper 1: The shown data do not display any error bars or any signature of replicate measurements, and no measurement of the blank. Replicate measurements were acquired only for the upper value of the investigated concentration range (100 ppb). Extrapolating the relative standard deviation of the upper concentration value to the lower concentration value of the investigated range (1 ppb), I estimated the signal that would significantly differ from that of the 1 ppb sample. That signal falls within the range in which the signal vs. concentration plot is linear (1 ppb to 100 ppb). I used that linear relationship to graphically determine a LOD: my estimate was 100 times higher than that stated (10 ppb vs. 0.1 ppb). This is of course an upper estimate of the LOD since it is derived from the lowest end of the investigated concentration range and not the blank. But the first concentration to give a signal significantly higher than that of the 1 ppb sample is 10 ppb (with 93% confidence) and the signal varies linearly with concentration on the 1-100 ppb range. Yet it is claimed that a blank sample can be distinguished with the same level of confidence from a 0.1 ppb sample. This is odd and would deserve thorough double checking by simply performing measurements of the blank.
  • Paper 2: A logarithmic concentration range was investigated, with replicate measurements (and error bars) for each of the concentrations. On the upper half of the investigated concentration range, the signal vs. logarithm of the concentration plot was linear. On the lower half, the signal dependence made a plateau. However, there were significant differences between the signals of contiguous concentrations, even in that non-linear section. The authors used these significant differences to reach an upper estimate of the LOD. They looked for the smallest pair of contiguous concentrations that would give signals being apart by more than three standard deviations (the standard deviations of the signals were similar for these low concentrations). This is exactly in the spirit of the IUPAC definition and it proves Wolfgang right: yes you can detect the presence of an analyte even out of the range where the signal dependence on the concentration is linear. You cannot say how much analyte there is but you can say with 93% confidence that it is not the blank sample. However, in this pair of contiguous concentrations that gave signals being apart by more than three standard deviations (10-7 and 10-8 M), the authors picked the lowest concentration of the two as the LOD estimate. In my opinion they should have picked the greater concentration (10-7 M) of that pair since the lowest (10-8 M) is not statistically different from the next lowest contiguous concentration (10-9 M). But that is a detail, the methodology looks sound to me
  • Paper 3: Again, a logarithmic concentration range is investigated here. The signal is plotted against the logarithm of the concentration (from 2 ppt to 1 ppb), with error bars on the data points, and fitted to a linear model. Obviously, the blank cannot be represented as a data point on that plot since log(0) is undefined. However, the text states that the LOD has been inferred from the standard deviation of the blank. A blank was therefore measured but the details of those measurements, its mean and SD values, are not reported (a horizontal line with a shaded envelope could have been added onto the graph to display yB and sB respectively). From the claimed LOD (0.8 ppt) and drawn error bars of the non-blank samples, I graphically searched for the signal value that was used as an estimate of the mean signal of the blank. It coincides with the intercept of the calibration curve with the drawn y-axis. Therefore I suspect that the mean blank signal yB was inferred by extrapolation of the calibration model. I also suspect that the standard error of that calibration model (inferred from the residuals) was used as an estimate of the SD of the blank.
  • Paper 4: Another logarithmic concentration range here. The text states that the signal of the “background” was measured but there are no hint of those measurements (as a line with envelope for instance) on the signal vs. logarithm of the concentration. The stated method for estimating the LOD seems to diverge from the IUPAC definition in as much as it does not refer to the standard deviation of the blank: “The LOD was calculated without Hg2+ giving SERS signal at least three times higher than background.” The claimed LOD is 0.45 ppt. Yet on the data, the signals recorded for the standards at 0.5 and 1 ppt do not appear significantly distinct and the signal vs. concentration plot over the range 0.5 ppt – 5 ppb has a sigmoidal shape. So I would doubt that the blank signal is significantly different from that of a 0.45 ppt sample.
  • Paper 5: A linear concentration range was explored here (1.1-61.1 nM). There are error bars on the data points. A blank sample does not seem to have been measured: no mention of it in the text and no display on the graph. Using the error bar and the mean of the lowest investigated concentration, I determined the value of a signal that would be three standard deviations higher. It fell into the signals obtained for the investigated concentration range. I then graphically read an estimate of the corresponding concentration: it was about 10 times higher than the claimed LOD (10 nM vs. 0.8 nM). I could not figure out how the claimed LOD was derived.

To summarize:

  • Yes, Wolfgang has a point. One can have the capacity to say that an analyte is present with fair confidence (to detect it) outside of the range where you can measure it (the Paper 2 example).
  • But I also have a point: for this to happen, you need to acquire data outside of the range in which you have established a calibration model. You cannot say anything about a concentration range in which you have not actually performed measurements. And this is unfortunately often the case (Papers 1, 3, 4 and 5).
  • The quality of these blank measurements matter a lot as well. If you do not acquire enough readings, your estimate of the mean (yB) is likely to diverge from the true mean because your frequency distribution won’t be smooth enough to see the true mean with precision. The more measurements you make, the smoother the frequency distribution, the closer your estimate gets to the true mean. For instance, if you estimate the mean from 3 measurements, you know with 95% confidence that the true mean lies within ± 4.30 ⨯ sB/√3 ≈ ± 2.48 ⨯ sB about your estimate of the mean. If you record 10 measurements, that interval shrinks to ± 2.26 ⨯sB/√10 ≈ ± 0.71 ⨯ sB.
  • Often, people infer the properties of the blank sample through extrapolating the linear portion of the signal vs. concentration (or logarithm of concentration). This can go very wrong. I will spend some time on it in the last section of this really, really long post.

Why should we not infer the properties of a blank sample by extrapolation of a linear signal vs. concentration plot?

Let’s say that we have acquired measurement data for several analyte solutions spanning a linear concentration range. And we plot it. On this graph, I have expanded the linear regression line towards the y-axis. If the regression model is valid down to the zero concentration, the mean signal of the blank sample will be y0. If the data show homoscedasticity, namely a homogeneous variance throughout the calibration range, we can give as an estimate of the standard deviation of the blank the same standard deviation as that observed for the data, s0.

Then we acquire repeated measurements of a blank sample, enough to have a smooth frequency distribution (let’s say 30). We calculate the mean yB and standard deviation sB. There are three possible mathematical options: yB is either equal, smaller or greater than y0. Let’s look at those three cases.

Case n°1: the two mean yB and y0 do not differ significantly (as ascertained for instance by a t-test with yB, sB, y0 and s0 as arguments). This is a trivial case: the blank should be included in the calibration model. (The LOD would then fall within the linear calibration range. Yep, I can be irritating.)

Case n°2: the mean blank signal inferred from the measurements, yB, is significantly smaller than y0 (as ascertained by a one-tailed t-test). In this case, the true LOD is smaller than the one estimated from a y0+3s0 signal. It could also be smaller than the lower end of the linear calibration range (data point (c1,y1)). This would occur if yB+3sB <y1. I could not recall any example of such situation but that hardly counts as a solid proof that it is impossible. At any rate, this proves the necessity of exploring concentrations outside the linear calibration range because the sensor could then have a use, a detection one and not a quantification one, outside the initially explored range.

Case n°3: the mean blank signal inferred from the measurements, yB, is significantly greater than y0 (as ascertained by a one-tailed t-test). An estimation of the LOD based on a signal would then be wrong. What one can do is to check whether y1 and yB are significantly different. If not, then the true LOD needs to be looked for within the linear calibration range (there, I am still trying to push my point). If y1 and yB do differ significantly, the true LOD needs to be looked for between 0 and c1, by acquiring more data. This type of non-linearity is what you could observe in a protein assay for instance, when the fraction of proteins consumed by adsorption on the sidewalls of the cuvettes becomes non negligible compared to the total amount in solution.

I wrote most of this stuff with the help of Statistics and chemometrics for analytical chemistry, by Miller & Miller. This handy little book has been used and copied so much in my group that its pages are coming apart, the definitive experimental demonstration that it is worth reading.

And with this, I sincerely hope I have not bored you to death thank you for your attention.

Editor’s note. Did you know that you can cite blog posts in the scientific literature? For example, you could cite this one as follows: Charron, Gaëlle. “Down the rabbit hole of the Limit Of Detection (LOD)”. Rapha-z-lab, 14/12/2021.

What’s a limit of detection anyway? Wolfgang Parak responds to Gaëlle Charron’s blog post

In her guest post (Sensing by Surface Enhanced Raman Scattering (SERS) : to the Moon and back down to Earth again) published last week, Gaëlle criticised SERS articles that reported a limit of detection (LOD) below the limit of the linear range:

The range onto which the sensor responds linearly, onto which the signal vs. concentration calibration model will be built, is 0.5-1000 nM. Yet a LOD of 0.18 nM is claimed. What happens to the signal dependence on the concentration below 0.5 nM is either too noisy or too flat to enter the calibration model or it has simply not been tested. Yet, the sensor is claimed to be operational at a concentration within this unchartered territory. Out of the 13 entries in the table dealing with mercury sensing, 9 displays a LOD below the lower limit of the sensitivity range. Error is not incidental here, it is the norm.

Wolfgang does not agree that this is an error and he wrote to us. Here is his letter.

Dear Gaëlle and Raphaël

I am not sure if I understand one of your arguments. In my point of view the LOD, the limit of detection, can be lower than the lowest value of the linear range. I thought the typical definition of the LOD is the concentration in which the detection signal is at least three times bigger than the noise in the signal. This is an “all or nothing” response. At the LOD you can tell that “there is something”, but you can’t necessarily tell how much. The range of the linear response is much harder to achieve. Here the signal needs to go linearly with the concentration. I think you can have the case where for example at 1 nM you see a signal (3 times higher than noise), but for example at 2 nM the signal would not be twice. You see it, but due to noise and sensor response properties you could not really quantify how high the concentration is. You could in this case for example have from 10 nM to 500 nM a linear response, where the response really follows in a linear behavior to the concentration. Thus, for my understanding in general the LOD can be lower than the lower limit of the linear range.

I am not 100% sure about this, but this is how I understood the definition.

Best wishes, Wolfgang

What do you think? Can a LOD be below the linear sensitivity range of a sensor?

Guest Post: Sensing by Surface Enhanced Raman Scattering (SERS) : to the Moon and back down to earth again

This is a guest post by Gaëlle Charron, Maîtresse de conférences at Université de Paris.

I was about to submit a paper about the detection of atomic ions by SERS the other day. The paper had been in the pipeline for months. I went through a last survey of the recent literature to check for fresh references that it would have been unfair to leave out. When I bumped into a 22 pages review just about that: Examples in the detection of heavy metal ions based on surface-enhanced Raman scaterring spectroscopy.

It sucks, I thought, as cold sweat was pouring down my neck. I have been working on this “novel” idea for 9 years now. Revisiting the dyes developed as colorimetric indicators for the detection of metal ions through a SERS angle. SERS sensors exploiting not the absorption properties of the indicators, but their vibrational signatures. The first time I thought about it sometime in 2012, I was excited as a blinking Christmas tree. The literature about colorimetric quantification of metal ions, mainly from the 40’s, 50’s and 60’s was rich, reliable and pretty informative. Lots of options for commercial indicators, lots of experimental details, many of them put to use in classic lab courses. And above all, lots of thermodynamic constants to use to emulate the chemical system. It felt like I could do nano properly, to the quantitative standards of textbook chemistry. Obviously, many people would see the same opportunity, as every undergraduate chemistry student will have played with these complexometric indicators at one point or another in her curriculum. About a month after my initial epiphany, I discovered that Luis Liz-Marzàn had already killed the game. An ACS Nano paper about ultrasensitive chloride detection, in the pM range. And now a full review was out, with its 80 examples of metal ion detection by SERS. I was late.

Feeling moody, I dived into the review. A general introduction about how deleterious and ubiquitous metal contamination is, about how heavy metals are usually quantified and about the limitations of those methods. The classic primer about SERS and its accepted mechanisms. And then, metal target by metal target, examples of dedicated SERS sensors.

In the general introduction, a sentence caught my attention. More and more researchers have used SERS technology to detect and quantitatively or semiquantitatively analyse heavy metal ions in various environments. That sentence cites a paper of mine about the setting-up of a SERS sensor of Zn2+ in pure water, ie. in the simplest of matrices, in a lab environment. Like in the other cited references associated with that sentence, my team did not use SERS technology to quantify a metal contaminant in the environment. We just examined the possibility of quantifying that contaminant with SERS. Much like examining the effectiveness of a drug to treat a disease does not mean that the drug is used to cure the disease.

Why does it make a big difference? For one, for the sake of accuracy. Then because there are many shortcomings to developing any new chemical analysis method. Is the sensitivity appropriate for the concentration range in which the target analyte will likely be encountered? Are the readouts true enough, precise enough? How much time, effort and money does it take to produce a readout? How likely is the method to work every day of the week when we switch the spectrometer on or open the fridge to reach for a nanoparticle batch? All of the above compared to the standard analytical methods? All of the above when analysing a typical specimen of the targeted samples and matrices?

The authors of the review acknowledged, at least partially, those potential pitfalls. Summary tables of examples of detection were given for each metal ions with ranges of linear response to analyte concentration, LODs and comments. The latter listed the following adjectives: sensitive, accurate, anti-interference, reliable, complicated, selective, low sensitivity, simple, rapid, low reproducibility. Yet, at no point in the review is the practicality of the reviewed SERS sensors discussed in comparison with the standard methods used by people actually performing chemical analysis of contaminants. It seems like those people were never consulted. Like the mad nano-scientists and the end-users were never put in the same room with a tea trolley.

The simplest illustration of this appears in the reported LODs and sensitivity ranges, for instance in the case of mercury quantification. The maximum concentration in drinking water set by the US Environmental Protection Agency is 10 nM. The concentration range for mercury in drinking-water is the same as in rain, with an average of about 125 pM. Naturally occurring mercury concentration in groundwater is less than 2.5 nM. Yet Table 2 of the review lists a sensitivity range of 10 fM to 100 pM, fully irrelevant to flagging an abnormal concentrations in drinking water, rain water or groundwater, or several sensitivity ranges unsuitable to assess the safety of drinking water (9.97 pM-4.99 nM, 4.99 pM-2.49 nM).

The focus of the description of those sensors is on the chemical schemes put to use, many of which sound rather overhyped.

Wang et al. created a dual signal amplification strategy based on antigenantibody reaction to recognize copper ions (Figure 4c) [78]. Specifically, they started with decorating the multiple antibiotic resistance regulator (MarR) that worked as bridging molecules and 4-MBA served as a Raman reporter on the surface of AuNPs, and then Cu2+ions generated disulfide bonds between the two MarR dimers by oxidizing cysteine residues, which induced the formation of the MarR tetramers, leading to the aggregation of AuNPs and the reinforcement of the SERS signal of 4-MBA. In the meantime, another substrate, AgNPs capped with anti-Histag antibodies combined with MarR (C-terminal His tag) to constitute dual hot spots and the reticulation of AuNPAgNP heterodimers. The dramatic signal enhancement allowed the detection limit to reach 0.18 nM with a linear response in the range of 0.51000 nM.

Take a deep breath. And a drink. The smart mouth contest goes on and on.

Without quite a full chemical legitimacy I would say. In the previous example, you might have felt an itch. Let’s rewind and replay.

The dramatic signal enhancement allowed the detection limit to reach 0.18 nM with a linear response in the range of 0.51000 nM.

The range onto which the sensor responds linearly, onto which the signal vs. concentration calibration model will be built, is 0.5-1000 nM. Yet a LOD of 0.18 nM is claimed. What happens to the signal dependence on the concentration below 0.5 nM is either too noisy or too flat to enter the calibration model or it has simply not been tested. Yet, the sensor is claimed to be operational at a concentration within this unchartered territory. Out of the 13 entries in the table dealing with mercury sensing, 9 displays a LOD below the lower limit of the sensitivity range. Error is not incidental here, it is the norm.

Also, as an undergraduate, I had learnt that an indicator abruptly changes speciation at an analyte concentration on the order of the dissociation constant of the indicator-analyte complex. A sensitivity in the pM would call for a 10-12 dissociation constant, a magnitude that is only encountered with chelating ligands with many binding atoms and/or at high pH, conditions that were not discussed, and very seldom met in the reviewed examples. But that may be the object of a full discussion in itself: does anyone understand how the sensing actually occurs, I mean beside our fantasized chemical sketches?

I still went through the full review. The conclusion nearly had a point, too bad they did not think it was worth an actual discussion.

At present, SERS technology basically stays as a laboratory test, which still has a big challenge for the quantitative testing on-site and actual complex samples, so it cannot be regarded as one of the conventional detection assays.

(Damn right it isn’t. I have been chasing my own tail for 9 years.)

Closing the paper print, an image came to my mind. That of a father of three kids going to a car dealership, looking for a vehicle that could accommodate three child safety seats. To which the dealer presents a half-assembled Lamborghini.

–          Preliminary tests indicates that it can go to 200 km/hr in 10 s. It will have a DVD screen on the passenger side.

–          I doubt the 3 car seats will fit at the back.

–          Leather seats are included.

–          It is half-finished. Plus it has breadcrumbs all over and a large crack into the windshield.

–          Yeah. You might want to fix that before you drive with the kids.

Lamborghini Model SERS – Credit to Nathanaël Lévy (13 years old)