Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions examples/semantic_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Semantic Search Example with sqlite-vector

This example in Python demonstrates how to build a semantic search engine using the [sqlite-vector](https://github.com/sqliteai/sqlite-vector) extension and a Sentence Transformer model. It allows you to index and search documents using vector similarity, powered by a local LLM embedding model.

### How it works

- **Embeddings**: Uses [sentence-transformers](https://huggingface.co/sentence-transformers) to generate dense vector representations (embeddings) for text. The default model is [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), a fast, lightweight model (384 dimensions) suitable for semantic search and retrieval tasks.
- **Vector Store and Search**: Embeddings are stored in SQLite using the [`sqlite-vector`](https://github.com/sqliteai/sqlite-vector) extension, enabling fast similarity search (cosine distance) directly in the database.
- **Sample Data**: The `samples/` directory contains example documents you can index and search immediately.

### Installation

1. Download the `sqlite-vector` extension for your platform [here](https://github.com/sqliteai/sqlite-vector/releases).

2. Extract the `vector.so` file in the main directory of the project.

3. Install the dependencies:


```bash
$ python -m venv venv

$ source venv/bin/activate

$ pip install -r requirements.txt
```

4. On first use, the required model will be downloaded automatically.

### Usage

Use the interactive mode to keep the model in memory and run multiple queries efficiently:

```bash
python semsearch.py --repl

# Index a directory of documents
semsearch> index ./samples

# Search for similar documents
semsearch> search "neural network architectures for image recognition"
```

### Example Queries

Try these queries to test semantic similarity:

- "neural network architectures for image recognition"
- "reinforcement learning in autonomous systems"
- "explainable artificial intelligence methods"
- "AI governance and regulatory compliance"
- "network intrusion detection systems"

**Note:**
- Supported extension are `.md`, `.txt`, `.py`, `.js`, `.html`, `.css`, `.sql`, `.json`, `.xml`.
- For more details, see the code in `semsearch.py` and `semantic_search.py`.
1 change: 1 addition & 0 deletions examples/semantic_search/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sentence-transformers
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 1: Deep Learning Neural Networks

Deep learning utilizes artificial neural networks with multiple layers to process and learn from vast amounts of data. These networks automatically discover intricate patterns and representations without manual feature engineering. Convolutional neural networks excel at image recognition tasks, while recurrent neural networks handle sequential data like text and speech. Popular frameworks include TensorFlow, PyTorch, and Keras. Deep learning has revolutionized computer vision, natural language processing, and speech recognition applications.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 10: Zero Trust Security Architecture

Zero trust security operates on the principle of "never trust, always verify," requiring authentication and authorization for every access request regardless of location. This approach assumes breach scenarios and implements continuous verification throughout the network. Key components include identity verification, device compliance checking, least privilege access, and micro-segmentation. Zero trust frameworks help organizations protect against insider threats and advanced persistent attacks.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-11.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 11: Incident Response and Recovery

Effective incident response requires predefined procedures for detecting, containing, and recovering from security breaches. Response teams follow structured phases: preparation, identification, containment, eradication, recovery, and lessons learned. Critical activities include forensic analysis, stakeholder communication, system restoration, and process improvement. Regular tabletop exercises and response plan updates ensure organizations can quickly minimize damage and restore normal operations after security incidents.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 12: Machine Learning for Malware Detection

Machine learning enhances malware detection by analyzing file characteristics, behavioral patterns, and network communications to identify threats. Static analysis examines file properties without execution, while dynamic analysis observes runtime behavior in controlled environments. Ensemble methods combining multiple algorithms improve detection accuracy and reduce false positives. AI-powered systems can identify zero-day threats and polymorphic malware that traditional signature-based solutions miss.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-13.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 13: Behavioral Analytics for Anomaly Detection

Behavioral analytics leverages machine learning to establish baseline patterns of normal user and system behavior, flagging deviations that may indicate security threats. User and entity behavior analytics (UEBA) systems monitor login patterns, data access, and application usage to detect insider threats and compromised accounts. Machine learning models adapt to changing behavior patterns while maintaining sensitivity to subtle anomalies that human analysts might overlook.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-14.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 14: AI-Driven Security Orchestration

Security orchestration platforms integrate multiple security tools and automate incident response workflows using artificial intelligence. These systems correlate alerts from various sources, prioritize threats based on risk assessment, and execute automated remediation actions. Natural language processing helps analyze threat intelligence reports, while machine learning improves decision-making accuracy over time. Orchestration reduces response times and analyst workload while maintaining consistent security procedures.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-15.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 15: Advanced Persistent Threats (APTs)

Advanced persistent threats represent sophisticated, long-term cyberattacks typically conducted by nation-states or organized criminal groups. APTs use multiple attack vectors, maintain persistent access, and employ stealth techniques to avoid detection. Common tactics include spear-phishing, zero-day exploits, living-off-the-land techniques, and lateral movement within networks. Defense requires continuous monitoring, threat hunting, and intelligence-driven security strategies to detect and neutralize these patient adversaries.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 16: Social Engineering Attack Vectors

Social engineering exploits human psychology rather than technical vulnerabilities to gain unauthorized access to systems and information. Common techniques include phishing emails, pretexting phone calls, baiting with infected media, and physical tailgating. Attackers research targets through social media and public information to craft convincing scenarios. Defense requires security awareness training, verification procedures, and creating organizational cultures that encourage reporting suspicious communications.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-17.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 17: Supply Chain Security Risks

Supply chain attacks target third-party vendors and software dependencies to compromise multiple organizations simultaneously. Attackers may insert malicious code into legitimate software updates, compromise hardware during manufacturing, or exploit trusted vendor relationships. Notable incidents include SolarWinds and Kaseya attacks affecting thousands of organizations. Mitigation strategies include vendor risk assessment, software composition analysis, and zero-trust principles for third-party integrations.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-18.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 18: Quantum Computing and Cryptography

Quantum computing poses both opportunities and threats for cybersecurity. Quantum computers could break current cryptographic algorithms like RSA and ECC that secure internet communications and data protection. Organizations must prepare for post-quantum cryptography by implementing quantum-resistant algorithms. However, quantum technologies also enable quantum key distribution for theoretically unbreakable communication channels. The transition period requires careful planning and gradual migration strategies.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-19.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 19: Edge Computing Security Challenges

Edge computing brings data processing closer to end users and devices, improving performance but creating new security challenges. Distributed edge nodes have limited security controls compared to centralized data centers. Attack surfaces expand across numerous endpoints with varying security capabilities. Key concerns include device authentication, data encryption, secure updates, and centralized security management. Zero-trust architectures and hardware-based security become essential for edge deployments.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 2: Natural Language Processing Fundamentals

Natural language processing enables computers to understand, interpret, and generate human language. Key techniques include tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis. Modern NLP leverages transformer architectures like BERT and GPT models for tasks such as language translation, text summarization, and question answering. Applications span chatbots, voice assistants, content moderation, and automated document analysis across various industries.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 20: IoT Security Vulnerabilities

Internet of Things devices often have weak security controls due to cost constraints and rapid deployment cycles. Common vulnerabilities include default passwords, unencrypted communications, lack of update mechanisms, and insufficient access controls. IoT botnets can launch massive distributed denial-of-service attacks. Security strategies include network segmentation, device lifecycle management, security-by-design principles, and regulatory compliance requirements for IoT manufacturers and deployments.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 3: Computer Vision Applications

Computer vision empowers machines to interpret and analyze visual information from images and videos. Core techniques include object detection, image classification, facial recognition, and motion tracking. Convolutional neural networks form the backbone of modern computer vision systems. Applications include autonomous vehicles, medical imaging diagnosis, quality control in manufacturing, augmented reality, and surveillance systems. Edge computing enables real-time computer vision processing on mobile devices.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 4: Reinforcement Learning Algorithms

Reinforcement learning trains agents to make optimal decisions through trial and error interactions with environments. Agents receive rewards or penalties based on their actions, gradually learning policies that maximize cumulative rewards. Q-learning and policy gradient methods are fundamental approaches. Applications include game playing (AlphaGo), robotics control, autonomous driving, recommendation systems, and financial trading algorithms. The exploration-exploitation trade-off remains a central challenge.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 5: Supervised vs Unsupervised Learning

Supervised learning uses labeled training data to predict outcomes for new inputs, including classification and regression tasks. Common algorithms include decision trees, support vector machines, and random forests. Unsupervised learning discovers hidden patterns in unlabeled data through clustering, dimensionality reduction, and association rules. Semi-supervised learning combines both approaches when labeled data is scarce. Each paradigm serves different problem types and data availability scenarios.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 6: AI Ethics and Bias Mitigation

Artificial intelligence systems can perpetuate or amplify human biases present in training data, leading to unfair outcomes across different demographic groups. Bias mitigation strategies include diverse dataset collection, algorithmic fairness constraints, and regular bias auditing. Ethical AI development requires transparency, accountability, and stakeholder involvement. Organizations must establish governance frameworks addressing privacy, consent, and algorithmic decision-making impacts on individuals and society.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 7: Explainable AI and Interpretability

Explainable AI focuses on making machine learning models more transparent and interpretable to human users. Black-box models like deep neural networks often lack interpretability, creating trust and accountability issues. Techniques include feature importance analysis, LIME (Local Interpretable Model-agnostic Explanations), and SHAP (SHapley Additive exPlanations). Interpretability is crucial for high-stakes applications like healthcare, finance, and criminal justice where decisions require justification.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 8: AI Regulation and Compliance

Governments worldwide are developing regulatory frameworks for artificial intelligence deployment and development. The European Union's AI Act categorizes AI systems by risk levels, imposing strict requirements for high-risk applications. Compliance involves documentation, risk assessment, human oversight, and algorithmic auditing. Organizations must navigate evolving regulations while maintaining innovation capabilities. Privacy laws like GDPR also impact AI data processing and automated decision-making systems.
3 changes: 3 additions & 0 deletions examples/semantic_search/samples/sample-9.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Article 9: Threat Detection and Prevention

Cybersecurity threat detection employs various technologies to identify malicious activities before they cause damage. Intrusion detection systems monitor network traffic for suspicious patterns, while endpoint protection software guards individual devices. Behavioral analysis identifies anomalies in user activities that may indicate compromised accounts. Security information and event management (SIEM) platforms aggregate and analyze security logs from multiple sources to provide comprehensive threat visibility.
Loading