Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

becauseofAI/ModernAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

61 Commits

Repository files navigation

ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

LLM
MLLM (LLaMA-based)

Survey

  1. Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [paper]
  2. MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [paper]

Large Language Model (LLM)

  1. OLMo: Accelerating the Science of Language Models [arXiv 2402] [paper] [code]

Chinese Large Language Model (CLLM)

  1. https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
  2. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
  3. https://github.com/LlamaFamily/Llama2-Chinese

Large Vision Backbone

  1. AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [paper] [code]

Large Vision Model (LVM)

  1. Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [paper] [code] (💥Visual GPT Time?)

Large Vision-Language Model (VLM)

  1. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [paper] [code]

Vision Foundation Model (VFM)

  1. SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [paper] [code]
  2. SSA: Semantic segment anything [github 2023] [paper] [code]
  3. SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [paper] [code]
  4. RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [paper] [code]
  5. Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [paper] [code]
  6. UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [paper] [code]
  7. APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [paper] [code]
  8. GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [paper] [code]
  9. OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [paper] [[code]]](https://github.com/lxtGH/OMG-Seg)
  10. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [paper] [[code]]](https://github.com/LiheYoung/Depth-Anything)
  11. ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/Lszcoding/ClipSAM)
  12. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/xzz2/pa-sam)
  13. YOLO-World: Real-Time Open-Vocabulary Object Detection [arXiv 2401] [paper] [[code]]](https://github.com/AILab-CVC/YOLO-World)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Model Vision Projector LLM OKVQA GQA VSR IconVQA VizWiz HM VQAv2 SQAI VQAT POPE MMEP MMEC MMB MMBCN SEEDI LLaVAW MM-Vet QBench
MiniGPT-v2 EVA-Clip-g Linear LLaMA-2-7B 56.92 60.3 60.62 47.72 32.9 58.22
MiniGPT-v2-Chat EVA-Clip-g Linear LLaMA-2-7B 57.81 60.1 62.91 51.51 53.6 58.81
Qwen-VL-Chat Qwen-7B 57.5 38.9 78.2 68.2 61.5 1487.5 360.72 60.6 56.7 58.2
LLaVA-1.5 Vicuna-1.5-7B 62.0 50.0 78.5 66.8 58.2 85.91 1510.7 316.1+ 64.3 58.3 58.6 63.4 30.5 58.7
LLaVA-1.5 +ShareGPT4V Vicuna-1.5-7B 57.2 80.62 68.4 1567.42 376.41 68.8 62.2 69.71 72.6 37.6 63.41
LLaVA-1.5 Vicuna-1.5-13B 63.31 53.6 80.0 71.6 61.3 85.91 1531.3 295.4+ 67.7 63.6 61.6 70.7 35.4 62.12
VILA-7B LLaMA-2-7B 62.3 57.8 79.9 68.2 64.4 85.52 1533.0 68.9 61.7 61.1 69.7 34.9
VILA-13B LLaMA-2-13B 63.31 60.62 80.81 73.71 66.61 84.2 1570.11 70.32 64.32 62.82 73.02 38.82
VILA-13B +ShareGPT4V LLaMA-2-13B 63.22 62.41 80.62 73.12 65.32 84.8 1556.5 70.81 65.41 61.4 78.41 45.71
SPHINX
SPHINX-Plus
SPHINX-Plus-2K
SPHINX-MoE
InternVL
LLaVA-1.6

+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
∗ indicates that the training images of the datasets are observed during training.

Paradigm Comparison
  1. LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [paper] [code]
  2. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [paper] [code]
  3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [paper] [code]
  4. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [paper] [code]
  5. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [paper] [code]
  6. VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [paper] [code]
  7. Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [paper] [code]
  8. NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [paper] [code]
  9. LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [paper] [arXiv 2310] [paper] [code]
  10. 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [paper] [code]
  11. 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [paper] [code]
  12. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [paper] [code]
  13. 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [paper] [code]
  14. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [paper] [code]
  15. LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [paper] [code]
  16. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [paper] [code]
  17. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [paper] [code]
  18. MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [arXiv 2310] [paper] [code]
  19. CogVLM: Visual Expert for Large Language Models [github 2310] [paper] [code]
  20. 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [paper] [code]
  21. SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [paper] [code]
  22. Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [paper] [code]
  23. 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [paper] [code]
  24. NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [paper] [project]
  25. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [paper] [code]
  26. InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [paper] [code]
  27. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
  28. 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [paper] [code]
  29. 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [paper] [code]
  30. CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [paper] [code]
  31. 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [paper] [code]
  32. 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [paper] [code]
  33. VILA: On Pre-training for Visual Language Models [arXiv 2312] [paper] [code]
  34. CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [paper] [code] (support ×ばつ1120 resolution)
  35. PixelLLM: Pixel Aligned Language Models [arXiv 2312] [paper] [code]
  36. 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [paper] [code]
  37. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [paper] [code]
  38. VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [paper] [code]
  39. Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [paper] [code]
  40. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [paper] [code]
  41. BakLLaVA-1: BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [code]
  42. LEGO: Language Enhanced Multi-modal Grounding Model [arXiv 2401] [paper] [code]
  43. MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [paper] [code]
  44. ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [paper] [code]
  45. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [paper] [code]
  46. LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [paper] [code]
  47. 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [paper] [code]
  48. MouSi: Poly-Visual-Expert Vision-Language Models [arXiv 2401] [paper] [code]
  49. Yi Vision Language Model [HF 2401]

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

  1. Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [paper] [code]

Image Generation with MLLM

  1. Generating Images with Multimodal Language Models [NeurIPS 2023] [paper] [code]
  2. DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [paper] [code]
  3. Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [paper] [code]
  4. KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [paper] [code]
  5. LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [paper] [code]

Modern Autonomous Driving (MAD)

End-to-End Solution

  1. UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [paper] [code]
  2. Scene as Occupancy [arXiv 2306] [paper] [code]
  3. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [paper] [code]
  4. BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [paper] [code]
  5. UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [paper] [code]

with Large Language Model

  1. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [paper] [code]
  2. LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [blog]
  3. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [paper] [code]

Embodied AI (EAI) and Robo Agent

  1. VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [paper] [code]
  2. PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [paper] [code]
  3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [paper] [code]
  4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [paper] [project]
  5. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [paper] [code]
  6. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [paper] [code]

Neural Radiance Fields (NeRF)

  1. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [paper] [code]

Diffusion Model

  1. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [paper] [code]
  2. Vlogger: Make Your Dream A Vlog [arXiv 2401] [paper] [code]
  3. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [paper] [code]

World Model

  1. CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [paper] [code]
  2. MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [paper] [code] [blog]
  3. GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [paper] [code]
  4. ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [paper] [code]
  5. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [paper] [code]
  6. LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [paper] [code]

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

  1. Sora: Video generation models as world simulators [openai 2402] [technical report] (💥Visual GPT Time?)

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

  1. [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [paper] [code]

New Dataset

  1. DriveLM: Drive on Language [paper] [project]
  2. MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [paper] [code]
  3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project] [blog]
  4. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [paper] [code] [dataset]
  5. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
  6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper] [code] [dataset]

New Vision Backbone

  1. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [paper] [code]
  2. VMamba: Visual State Space Model [arXiv 2401] [paper] [code]

Benchmark

  1. Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [paper] [code]

Platform and API

  1. SenseNova 商汤日日新开放平台 [url]

SOTA Downstream Task

Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world

About

Awesome Modern Artificial Intelligence.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /