Natural language processing (NLP) and text queries can run more effectively when you identify stop words. But what are they, and what can you do about them? Learn more about stop words and why it’s useful to avoid them.
![[Featured Image] Two SEO specialists sit at a table in a conference room and discuss the importance of stop words.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/1RGiiSwbv2kIsxBs3d3btd/6439c1054e44bc3d88ff1a1a98b4fc06/GettyImages-169416416.jpg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
Stop words, such as “the” and “with,” can make it more challenging to extract meaningful information from your data when programming an NLP model.
Certain topic modeling techniques, such as latent Dirichlet allocation (LDA), employ stop words to identify topics in document collections.
Job roles that may require an understanding of stop words include NLP engineers, NLP data scientists, artificial intelligence engineers, or software engineers.
You can learn more about stop words to enhance your skill set when working as a machine learning engineer.
Discover what stop words are, where they are used, and why to avoid them, and learn how to create a stop words list. If you’re ready to begin a career in data science, enroll in the IBM Data Science Professional Certificate, where in as little as four months, you can learn about data cleansing, supervised learning, generative AI, and more.
Words like “um,” “like,” and “you know,” that carry little information, are stop words. Stop words can make it more challenging to extract meaningful information from your data. A common approach for those who work in machine intelligence to program models is to create a stop words list. Identifying stop words in advance typically makes it easier for the model to decrease the “noise.”
In NLP, stop words are inconsequential words with little value in helping processors answer queries. When it comes to programming natural language processing (NLP) models and doing data retrieval, computers need to be told not to include these words. Someone typically needs to manually filter out words that would not help select relevant content. The NLP can ignore these uninformative words and, as a result, move more quickly through larger, more diverse amounts of data to realize insights. The specific stop words can vary based on context.
In English, examples of stop words include:
Articles: A, an, the
Conjunctions: And, but, or
Prepositions: In, on, at, with
Pronouns: He, she, it, they
Common verbs: Is, am, are, was, were, be, being, been
Yet different languages have different stop words (“and” in English, “und” in German). So do different databases based on their subject matter. What ProQuest considers a frequently used and, therefore, uninformative word can vary from what Clarivate Web of Science considers a stop word, for example.
Knowing stop words for databases you use regularly can also help you hone your search statements. You’ll know which words to exclude and describe your topic with the most significant words.
Read more: What Is a Database?
Stop words are important in search engine optimization because the search engine uses natural language processing to understand your request and deliver relevant search results. If the NLP algorithm powering search ignores stop words, it will ignore stop words within web content, which could potentially impact how content ranks in search results. Modern search engines have a greater ability to understand what users are searching for, which often depends on reading and interpreting words typically considered stop words.
For example, a search for denim jeans might deliver a list of potential jeans to buy, but a search for denim in jeans might return an informative article about the common fabrics used to create jeans. It may be more beneficial to think about using simple terms to describe your content exactly, rather than removing stop words as a strategy for SEO.
Information retrieval systems typically work with stop lists that collect uninformative words to discard during indexing. Filtering out stop words can help the system weigh the relevancy of the content to the topic searched. After all, deciding what data to store or retrieve often relies on determining the ratio of words related to the topic within the text to the number of words overall in the text. By cutting the stop list words, you can reduce the number of words overall considered, which can net more accurate results.
You might need to know about stop words if you want a career as an NLP engineer, NLP data scientist, machine learning engineer, artificial intelligence engineer, or software engineer. Understanding stop words can also help you in fields that rely on text mining. You might not design and develop the algorithms, but knowing how to search more effectively could aid your text analysis in a customer service, risk management, maintenance, health care research, or cybersecurity role.
Stop words improve accuracy and efficiency for information retrieval and search engines. They come in handy when classifying text (e.g., for sentiment analysis) and mining and analyzing large volumes of text. Eliminating stop words simplifies the identification of themes and patterns to surface important information.
Some topic modeling techniques, such as latent Dirichlet allocation (LDA), use stop words to identify topics in document collections.
You could also encounter stop words in machine translation, as filtering out the unimportant words reduces noise in the output.
You can find much discussion about the value of culling stop words. Researchers in the field continue to weigh the benefits and drawbacks of the stop words approach. In this section, we summarize some of the main points to consider.
Generally, the stop words approach, when done well, can benefit model quality. Databases programmed to ignore common words can provide more accurate results more efficiently. The model’s search improves, and you will simultaneously get fewer (yet more focused) results returned.
You can’t find a single, standardized list of stop words. That’s because the list of words needs to evolve continually. Plus, it should reflect domain knowledge and have language specificity. For example, Python has its own Natural Language Toolkit. Still, even when using Python, users within the field of finance and accounting might develop their own stop words around auditing or currencies.
The time it takes to curate a stop words list is another limitation. After all, someone has to construct that word list. Plus, having a human compile the list can enshrine bias, as whether a word qualifies or not is a subjective decision. For example, if someone aggressively prunes words from a model, the results could skew in the direction of whatever that analyst thought important from the outset (before the model even runs).
Generating a stop words list is a common solution to avoid the distraction of all those uninformative words. The source of your stop list depends on the context. You’ll need to consider the specific programming language, its own generic stop list (if it has one), and the scope and context of your searches. For example, a company with proprietary products might want to use even more specific terms.
Beyond these basics, you can find rigorous research studies into different approaches to developing stop word lists. Those who research data science balance the computational effort required against the cost of the method.
To keep up with trends and job opportunities in data science, join Career Chat on LinkedIn. Check out these other free resources:
Watch on YouTube: Data Science for Beginners: Your 3-Minute Crash Course
Hear from an expert: 6 Questions with an IBM Data Scientist and AI Engineer
Learn the terminology: Data Science Terminology and Definitions
Whether you want to develop a new skill, get comfortable with an in-demand technology, or advance your abilities, keep growing with a Coursera Plus subscription. You’ll get access to over 10,000 flexible courses.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.