SpaCy: Industrial-Strength Natural Language Processing (NLP) in Python

(github.com)

135 points | by marklit 168 days ago

13 comments

roadside_picnic 164 days ago

A friend, who also has a background in NLP, was asking me the other day "Is there still even a need for traditional NLP in the age of LLMs?"
This is one of the under-discussed areas of LLMs imho.
For anything that would have have required either word2vec embeddings of a tf-idf representation (classification tasks, sentiment analysis, etc) there are rare exceptions where it wouldn't just be better to start with a semantic embedding from an LLM.
For NER and similar data extraction tasks, the only advantage of traditional approaches is going to be speed, but my experience in practice is that accuracy is often much more important than speed. Again, I'm not sure why not start with an LLM in these cases.
There are still a few remaining use cases (PoS tagging comes to mind), but honestly, if I have a traditional NLP task today, I'm pretty sure I'm going to start with an LLM as my baseline.

[-]
- potatoman22 164 days ago
  
  The SpaCy creator has a good blog post on this https://explosion.ai/blog/against-llm-maximalism
  
  [-]
  - btown 164 days ago
    
    I'd go a step beyond this (excellent) post and posit that one incredibly valuable characteristic of traditional NLP is that it is largely immune to prompt injection attacks.
    Especially as LLMs continue to be better tuned to follow instructions that are intentionally colocated and intermingled with data in user messages, it becomes difficult to build systems that can provide real guarantees that "we'll follow your prompt, but not prompts that are in the data you provided."
    But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document, or to leak its own system instructions, or anything of that nature. "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
    There's an even broader discussion to be had about the relative reliability of NLP pipelines, outside of a security perspective. As always, it's important to pick the right tools for the job, and the SpaCy article linked in the parent puts this quite well.
    
    [-]
    - IanCal 163 days ago
      
      > But no amount of text appended to an input document, no matter how persuasive, can cause an NLP pipeline to change how it interprets the remainder of the document,
      Text added to a document can absolutely change how an NLP pipeline interprets the document.
      > "Ignore the above prompt" is just a sentence that doesn't seem like positive or on-topic sentiment to an NLP classifier, and that's it.
      And simple repeated words can absolutely make that kind of change for many NLP systems.
      Have you actually worked with doing more traditional NLP systems? They're really not smart.
      
      [-]
      - ffsm8 163 days ago
        
        > And simple repeated words can absolutely make that kind of change for many NLP systems.
        That's not what prompt injection is.
        And NLP stands for natural language processing. If the result didn't change after you've made changes to the input... It'd be a bug?
        
        [-]
        
        IanCal 163 days ago
        
        No? But repeated words can impact simple nlp setups. I’m not sure what case you’re concerned about where added text impacts classification with an LLM but added words shouldn’t with a different pipeline.
        > And NLP stands for natural language processing. If the result didn't change after you've made changes to the input... It'd be a bug?
        No, I’d want my classifier to be unchanged by garbage words added. It likely will be, but that impact is a bug not a feature.
        
        [-]
        
        ffsm8 162 days ago
        
        Prompt injection is about making the model do something else then specified.
        Adding words to the text to break the algorithm which does the NLP is more along the lines of providing 1 in a boolean field to break the system. And that's generally something you can mitigate to some degree via heuristics and sanity checking. Doing the same for LLMs is essentially impossible, because it's an effective black box, so you cannot determine the error scenarios and add some mitigations
        
        [-]
        
        IanCal 160 days ago
        
        If you don’t think this happens for simpler methods you’ve never deployed them. It’s the exact same problem on a classifier. Have you actually worked with these and are we discussing real world cases?
    - mfalcon 164 days ago
      
      I guess it depends on how you use the LLMs. We implemented some workflows where the LLMs were used only for dialogue understanding, then the system response was generated by classic backend code.
    - ACCount37 163 days ago
      
      If that's an issue for you, you do the year 2018 thing and just train classification heads for a base model LLM.
      No instruct tuning means prompt injection is curbed. Classification heads means you get results off a single forward pass.
  - Xmd5a 164 days ago
    
    https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...
- coder68 164 days ago
  
  I have been working on text classification tasks at work, and I have found that for my particular use-case, LLMs are not performing well at all. I have spent a few thousand dollars trying, and I have tried everything from few-shot to asking simple binary yes/no questions, and I have had mixed success.
  I have stopped trying to use LLMs for this project and switched to discriminative models (Logistic Regression with TFIDF or Embeddings), which are both more computationally efficient and more debuggable. I'm not entirely sure why, but for anything with many possible answers, or to which there is some subjectivity, I have not had success with LLMs simply due to inconsistency of responses.
  For VERY obvious tasks like: "is this store a restaurant or not?" I have definitely had success, so YMMV.
  
  [-]
  - noosphr 164 days ago
    
    When you say llms do you mean decoder only models, gpt et al, or encoder only models, bert et al?
    I've found encoder only models to be vastly better for anything that doesn't require natural language responses and the majority of them are small enough that _pretraining_ a model for each task costs a few hundred dollars.
    
    [-]
    - coder68 164 days ago
      
      By LLMs I meant decoder only, e.g. Gemini, Claude, etc. Can you go into more detail on how you're using the encoder models? I'm curious. Typically I have used them for embedding text or for fine-tuning after attaching a classifier head. What are you pre-training on, and for what task?
      
      [-]
      - roadside_picnic 164 days ago
        
        > how you're using the encoder models?
        In my original comment this is what I was referring to: using the embeddings produced by these models, not using something like GPT to classify text (that's wildly inefficient and in my experience gets subpar results).
        To answer your question: you simply use the embedding vector as the features in whatever model you're trying to train. I've found this to get significantly superior results with significantly less examples than any traditional NLP approach to vector representation.
        > What are you pre-training on, and for what task?
        My experience has been that you don't need to pretrain at all. The embeddings are more information rich than anything you could attempt to achieve with other vector representations you might come up with using the set of data you have. This might not be true at extreme scales, but for nearly all traditional nlp classification tasks I've found this to be so much easier to implement and so much better performing there's really not a good reason to start with a "simpler" approach.
        
        [-]
        
        coder68 164 days ago
        
        Ah yes this does make sense. We are definitely in agreement on the point of "wildly inefficient and subpar". I'll try out decoder model embeddings soon, e.g. Qwen/Qwen3-Embedding-8B. I'm working with largish amounts of data (200M records), so I tried to pick a good balance between size:perf:cost, using BAAI/bge-base-en-v1.5 to start (384 dim).
  - leobg 162 days ago
    
    If I have 1,000 labeled examples for a classification task, I’ll expand that into a training dataset using augmentation, and then finetune a small model like RoBERTa. It’s fast, cheap, accurate — and predictable.
    Others have had success with SetFit as the training framework and Ettin as the base model.
    
    [-]
    - coder68 162 days ago
      
      oh this seems like an interesting idea, what tactics do you use for augmentation? For my own use-case, I think I could reorder semantic chunks, or maybe randomly delete pieces, but curious what tactics you use!
      I have also considered training a small language model for synthetic data generation.
      
      [-]
      - leobg 158 days ago
        
        Yes, exactly. You want to randomize the parts that are irrelevant. For example, if you're classifying news articles, you may want to shorten them anyway. A human would be able to tell what category an article belongs to without reading the whole thing - so may do a combination of URL, headline, beginning, middle, and/or end. And if you do that, it's easy to turn one training example into 10 or more. You just vary the length of the individual parts.
  - IanCal 163 days ago
    
    It depends on a lot of things but to add to your possible setups you can potentially improve results by using simpler systems for first answers and falling back afterwards.
    For example:
    If contains cafe and not internet/cyber/etc -> restaurant
    No -> (tfidf) -> yes, no, unsure
    unsure -> embeddings -> yes, no, unsure
    unsure -> llm -> yes, no, unsure
    unsure -> human queue ->...
    
    [-]
    - coder68 163 days ago
      
      I think the idea of backoff by ratcheting up complexity here is a very good idea, thanks for your suggestions.
      
      [-]
      - IanCal 163 days ago
        
        Happy to help - this is a thing I’ve employed multiple times for real cases.
        One big benefit is that it uses the cheapest and most understandable approaches for the majority of cases, and scales up quite nicely. It has a neat place for very custom issues to be fixed too.
        There will always be some things that simple approaches think are clear but aren’t, which is awkward but then all pipelines end up with that somewhere.
        Edit - you can also deploy things earlier if you start from the beginning of the chain. Moving from big deploy to iteration on the remaining issues is often a win just in deployment issues.
        
        [-]
        
        coder68 162 days ago
        
        To chime in about where I'm at -- one problem was solved with a statistical classifier, but to bootstrap another, I ended up using keywords. It took a few hours to get a reasonable solution, and it leans more towards precision than recall, but it worked quickly!
  - siddheshgunjal 163 days ago
    
    At my work, we still prefer to use distilbert for text classification. It almost always does well with a little bit of fine tuning. In very rare cases, we use LLMs/Agentic setup when the task involves refering both images and text and the same time.
    
    [-]
    - coder68 163 days ago
      
      I can confirm that Distillbert has worked well when I have used it for classification, especially on shortish sequences. I'm really interested in trying out ModernBert, or a smaller variant due to the larger context window (8192 tokens).
      
      [-]
      - siddheshgunjal 163 days ago
        
        I was thinking of trying ModernBERT for one of my projects. But I can only conclude after seeing the performance for my usecase. Do you think ModernBERT will be capable of expanding abbreviated sentences?
  - littlekey 162 days ago
    
    Doesn't that mean having to go back to manually labeling examples? That can be a big hurdle compared to just zero-few shotting some stuff into the LLM prompt. Unless there's something I'm misunderstanding about your approach. Or maybe it's possible to do an unsupervised clustering step on the vectors to get the labeled categories that you can then pass to the supervised classification model. Though I guess that would depend on how strictly defined the target categories are for the use case in question.
    
    [-]
    - coder68 162 days ago
      
      To some degree manual labeling has to be done anyway, just to validate that any approach works at all, you'll always need ground truth from somewhere. What I suggested is that zero/few-shotting might not be good enough, depending on the problem. Labeling ~1000 samples isn't too bad, I've done it by hand a few times now. If you can source a high quality positive signal from somewhere (e.g. user-behavioral data), even better.
  - meander_water 164 days ago
    
    Are your categories fixed? If so you could constrain the output using enums in structured outputs.
    re: inconsistencies in output, OpenAI provide a seed and system_fingerprint options to (mostly) produce deterministic output.
    
    [-]
    - coder68 163 days ago
      
      The outputs are working correctly in terms of formatting, but the answers themselves may be inconsistent. I have experimented with varying the prompt and the answers can change dramatically. I could experiment with lowering temperature, but I just don't think generative models were a good fit for the problem. The appeal is the speed of prototyping and no need for training data, but it honestly didn't take much for my problem: one afternoon and ~1000 samples labeled got me to a good baseline.
- nine_k 164 days ago
  
  How about expense? LLMs do dramatically more computations doing simple tasks, and only run on relatively exotic, expensive hardware. You have to trust an LLM provider, and keep paying them.
  If a traditional NLP solution can run under your control, and tackle the task at hand, it can be plainly much cheaper at scale.
  
  [-]
  - lyu07282 163 days ago
    
    thats absurd, there are thousands of open-source LLMs you can run yourself of all shapes and sizes
    
    [-]
    - nine_k 163 days ago
      
      Are many of them comparable to Claude Sonnet or GPT-5? What kind of hardware do they require?
      
      [-]
      - lyu07282 163 days ago
        
        None of them of course. But the point is that even smaller open-source "LLMs" (more specifically transformer architectures) you can run anywhere yourself outperform these "traditional" pipelines with less compute. I would say that its not well defined what exactly "traditional" even means here though, since I wouldn't really even describe CNN/BiLSTMs as "traditional", in my mind that would be SpaCy <2.0 and NLTK (linear models SVMs/TF-IDF, Word2Vec/Glove/fastText/etc. etc.), LLMs are at least 2 generations ahead of those since there was the whole "deep learning" craze inbetween.
binarymax 164 days ago

I’ve been a user of SpaCy since 2016. I haven’t touched it in years and I just picked it up again to develop a new metric for RAG using part of speech coverage.
The API is one of the best ever, and really set the bar high for language tooling.
I’m glad it’s still around and getting updates. I had a bit of trouble integrating it with uv, but nothing too bad.
Thanks to the explosion team for making such an amazing project and keeping it going all these years.
To the new “AI” people in the room: checkout SpaCy, and see how well it works and how fast it chews through text. You might find yourself in a situation where you don’t need to send your data to OpenAI for some small things.
Edit: I almost forgot to add this little nugget of history: one of Huggingfaces first projects was a SpaCy extension for conference resolution. Built before their breakthrough with transformers https://github.com/huggingface/neuralcoref

[-]
- jehejej 164 days ago
  
  *coreference resolution.
- ok_dad 164 days ago
  
  What’s great about the API that you enjoy and do you have anything you hate about it?
  I’m writing a small library at work for some NLP tasks and I haven’t got a whole lot of experience in writing libraries for NLP, so I’m interested in what would make my library the best for the user.
  
  [-]
  - binarymax 163 days ago
    
    The thing about spaCys API is that it perfectly aligns with how NLP worked at the time with actual programming paradigms and allows you to be very pythonic. For example, you can use list comprehension to get all the nouns from a document in a one liner.
    These days NLP is quite different, because we look for outcomes rather than iterating over tokens.
    What does your NLP library need to do? The way I design APIs is I write the calling code that I want to exist, and then I write the API to make it work. Here’s an example I’ve worked on for LLM integration. I just wanted to be able to get simple answers from an LLM and cast the answer to a type: https://www.npmjs.com/package/llm-primitives
robotswantdata 164 days ago

SpaCy is the OG, nothing but praise for the devs. Built a lot of very powerful legal apps with it pre GPT , very useful today for NER where you want something “small”, fast and reliable.
Used it again recently and the dev experience is 1000x that of wrangling LLMs.
jftuga 163 days ago

I recently wrote an open source Python module to deidentify people's names and gender specific pronouns. It uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling. See the screenshot in the README.md file.
* https://github.com/jftuga/deidentification
* https://pypi.org/project/text-deidentification/
bratao 164 days ago

I'm really curious about the history of spaCy. From my PoV: it grew a lot during the pandemic era, hiring a lot of employees. I remember something about raising money for the first time. It was very competitive in NLP tasks. Now it seems that it has scaled back considerably, with a dramatic reduction in employees and a total slowdown of the project. The v4 version looks postponed. It isn't competitive in many tasks anymore (for tasks such as NER, I get better results by fine-tuning a BERT model), and the transformer integration is confusing.

[-]
- cantdutchthis 164 days ago
  
  former employee here, Matt wrote a blogpost with pretty much all of the details here: https://honnibal.dev/blog/back-to-our-roots
  
  [-]
  - microtonal 164 days ago
    
    :wave:
    Also: https://explosion.ai/blog/back-to-our-roots-company-update
    (Interesting tidbit: I got hired by Explosion after a HN comment on model distillation :))
- binarymax 164 days ago
  
  I’ve had success with fine tuning their transformer model. The issue was that there was only one of them per language, compared to huggingface where you have a choice of many of quality variants that best align with your domain and data.
  The SpaCy API is just so nice. I love the ease of iterating over sentences, spans, and tokens and having the enrichment right there. Pipelines are super easy, and patterns are fantastic. It’s just a different use case than BERT.
patrickhogan1 164 days ago

SpaCy was my go to library for NER before GPT 3+. It was 10x better than regex (though you could also include regex within your pipelines.
Its annotation tooling was so far ahead. It is still crazy to me that so much of the value in the data annotation space went to Scale AI vs tools like SpaCy that enabled annotation at scale in the enterprise.
skeptrune 164 days ago

SpaCy is criminally underrated. I expect to see it experience a new wave of growth as folks new to AI start to realize all of the language tooling they need to build more reliable "traditional" ML pipelines.
API surface is designed well and it's still actively maintained almost 10 years after it initially went public.

[-]
- chpatrick 164 days ago
  
  Is there any use case for "traditional" NLP in the age of LLMs?
  
  [-]
  - skeptrune 164 days ago
    
    Most definitely! LLMs are amazing tools for generating synthetic datasets that can be used alongside traditional NLP to train things like decision trees with libraries like cat/xgboost.
    I have a search background so learning to rank is always top of mind for me, but there other places like sentiment analysis, intent detection, and topic classification where it's great too.
    
    [-]
    - coder68 164 days ago
      
      Do you have any sources/links that talk about this? I'm very interested in synthetic data generation, so curious what you've tried or what works / doesn't work, especially with regards to LTR.
    - chpatrick 163 days ago
      
      But for the analysis use cases you mentioned, can't you just ask an LLM to read the text and output the answer as JSON, and you're done? Is it just because running LLMs is expensive?
      
      [-]
      - skeptrune 163 days ago
        
        No, it's just slow and less accurate. Wrong tool for the job when you care a lot about understanding the reasoning and internals of what the model is caring the most about.
  - binarymax 164 days ago
    
    Some low hanging fruit: SpaCy makes an amazing chunking tool for preprocessing text for LLMs.
  - lyu07282 163 days ago
    
    I used to work a lot with those pipelines, I think the truth is that LLMs (and LLM embeddings) have surpassed pretty much all traditional NLP. I guess if speed is more important than accuracy? but even then, like with small embedded LLMs they still outperform "traditional NLP" on pretty much every task probably. So it doesn't make a lot of sense to not use it nowadays.
joshdavham 164 days ago

I’ve been using SpaCy for many of my projects for 5 years now. The library has incredible ergonomics and allows you to reuse the same API across languages as different as French and Japanese! I also appreciate that they allow you to install different model sizes (I usually go with small).
erikqu 163 days ago

I figured this project died post-chatgpt, I <3 spacy, learned a ton on this platform back in the day
bobosha 163 days ago

SpaCy is awesome - we have used it in a number of enterprise-grade applications and found it to hold up well.
renegat0x0 164 days ago

I use spacy in my raspberry pi project. I am not sure I want to use LLM for analyzing words in it.
giantg2 164 days ago

What are the key differences from other NLP Python libraries?

[-]
- jihadjihad 164 days ago
  
  Speed (the C in spaCy). A decade ago it was hard to find anything actually production grade for NLP, most packages had an academic bent or were useful for prototyping. SpaCy really changed the game by being able to run performant NLP on standard hardware.
- esafak 164 days ago
  
  nltk was slow.
  
  [-]
  - EagnaIonat 164 days ago
    
    nltk was never intended for production, it was for built for teaching.
ur-whale 163 days ago

At the risk of asking a naive question ... why would anyone still do traditional NLP today?

[-]
- nutjob2 163 days ago
  
  There is a need for good and easy NLP "structured" interfaces to traditional structured data software. This is a gaping hole in the tech right now. Most other NLP tasks can be handled by ML approaches but this one is a poor fit for those. I'm sure LLM true believers will disagree.
  
  [-]
  - apprentice7 163 days ago
    
    Could you give a couple specific examples? I'm trying to get into traditional NLP but everything I find is AI related and I don't know if it's worth going the traditional route long-term.