Javascript

How to remove all HTML tags from a string?2 min read

When counting text characters in HTML strings, you usually do not want to include the characters in the HTML markup code. While parsing HTML with RegExp does have its limits, in this case, RegExp works very well.

To strip all HTML tags from a string you can do a RegExp replace with /<(.|n)*?>/g:

const html = ‘<p>Lorem <a href=”/page”>ipsum</a> <img src=”/image.png”></p>’;
const text = html.replace(/<(.|n)*?>/g, );

console.log(text);
// Output: Lorem ipsum

If the text is already rendered in a browser

If the content is already rendered in a browser context, you can simply find the element and use .textContent:

const element = document.querySelector(‘#my-element’);
const text = element.textContent;

console.log(text);
// Output: Only text

XSS (cross-site scripting) risks of filtering with innerHTML/textContent

It is also technically possible to use the less secure alternative .innerHTML/.textContent :

element.innerHTML = ‘<p>Lorem ipsum</p>’;
const text = element.textContent;

Even DOMParser.parseFromString() can be used, however a word of caution here. Assigning HTML code directly with .innerHTML risks leaving your system vulnerable to cross-site scripting (XSS) attacks as the code might come from user input.

This means that any JavaScript in that HTML code will run, and references to assets such as images or fonts will trigger HTTP requests to external domains where the URLs might expose information about the user.

If you still want to use this solution, I would recommend first filtering the string with a security package like DOMPutify to be sure. For this case, this makes the solution much more complex than using the first suggested RegExp, so probably the RegExp would be the preferred solution here.

Count and stay safe!

Pin It on Pinterest

Generated by Feedzy