머신 러닝(Machine Learning)
위키미디어 재단의 머신러닝 팀 홈페이지에 오신 것을 환영합니다.
Machine Learning
위키미디어 재단의 머신러닝 프로젝트
|
저희 팀은 위키미디어의 최종 사용자를 위한 머신 러닝 모델의 개발하고 관리하며 모델 설계, 훈련, 배포에 필요한 모든 인프라를 관리감독하고 있습니다.
진행중인 프로젝트
- 머신러닝 모델카드
- The 머신러닝 모델 업데이트 프로젝트
- 리프트윙 - 케이서브(KServe) 및 쿠버네티스(Kubernetes)를 활용한 머신러닝 모델 인프라 구축
For archived projects, see this list.
연락처
질문이 있나요? 팀원이나 자원봉사자와 머신러닝에 대해 얘기해보고 싶으신가요? 그렇다면 다음 방법을 추천드립니다.
팀 채팅
irc.libera.chat
서버의 #wikimedia-ml connect의 공개 IRC채팅방에서 머신러닝에 대해 토의하고 팀워킹에 참여해 보세요.
업데이트 사항
-
- We are continuing to integrate the article-country model into Liftwing. The article country model predicts which countries any particular model will be applicable for, and it's an extension of the article-topic model, which we have used for years.
- We're trying different approaches to build vllm (a high-throughput and memory-efficient system designed for serving large language models) and ROCm (the code that allows the CPU to talk to AMD GPUs) with Ubuntu. This is part of the work of making production LLMs on Liftwing possible.
- We're currently working on configuring the ML Lab servers. These are for model training.
- Updated the rec-api image deployment model. Deployed the reference need model to production.
-
- Following up on recurring issue reported by the Structured Content team: The MediaDetection API can access the logo-detection endpoint via mwdebug1001.eqiad.wmnet and mwdebug2001.codfw.wmnet, but can't access it on k8s-mwdebug
- Adding logo-detection documentation to the API portal docs.
- Investigating occasional slow queries on LiftWing when using some RevScoring models
- Continuing the remaining work on the pre-save revertrisk model. This model is designed to provide a vandalism prediction before an edit is saved to Wikipedia (and thus doesn't have a revision ID)
- Work continues on upgrading kserve to 0.13
- Initializing install config for the GPU hosts in eqiad
-
- 늦은 업데이트에 대해 사과드립니다. 제가 코로나에 걸렸어요.
-
- 로고 감지 모델은 계속 업데이트 중입니다. 이미지 URL 대신 base64 이미지 객체를 처리하는 예제 로고 감지 모델 서버를 구축한 뒤 데이터(Structured Content)팀에게 이에 대한 의견을 물어본 상태입니다.
- HuggingFace 모델은 서버작업중입니다.
- 전반적인 버그 수정 및 개선사항
-
- 로고 탐지 모델 작업은 계속되고 있습니다. 이번 주에는 팀 전체가 이미지를 리프트윙에 직접 전송할 것인지를 대해 논의했습니다. 그 외에 이미지의 위치의 URL를 수신하고, 이 URL를 통해 리프트윙이 액세스/다운로드할 수 있게 합니다. 이것은 REST의 용량량, 특히 배팅 요청에 영향을 미치기는 것으로 중요한 문제입니다.
- We are still working on the HuggingFace model server GPU issue (i.e. it won't recognize our AMD GPU). There are a number of possibilities as to why, but we want this resolved before we finalize our order for this fiscal year.
- A number of misc bug fixes and improvements.
-
- Our big Istio refactoring is underway (slides)! This refactoring will allow us to remove a lot of networking logic out of individual model containers. For example, currently if there was some changing to the `discovery.wmnet` endpoint (WMF's internal endpoint for APIs), we'd have to update hundreds of individual model containers and redeploy them. This refactoring removes this need entirely.
- We've been deploying AMD's open source software stack (ROCm) inside each k8s node, but we suspect this has been unnecessary (and actually causing some problems) because PyTorch already has a version of ROCm included in the library. This work is being prioritized because completing it is a requiring for making the large GPU order have have planned later in the quarter.
- We are preparing a patch that enables logo-detection model server to access external URLs using internal k8s endpoints. This is part of some of the changes we needed to make to deploy the model.
- Continuing to test the HuggingFace model server image on our Lift Wing nodes. This work was paused for the week while the engineer attended the Wikipedia Hackathon in Tallinn.
- Lift Wing caching work has been paused until the Istio refactoring is complete.
-
- Reviewing and testing the big patch for the ORES extension. The ORES extension provides a way to see the probability that a particular edit is reverted for all edits on the recent changes page of many Wikis. The new revert risk model into the extension so that volunteers can use that new model when hunting down potential vandalism.
- We're still doing some tweaks for the image processing for the logo detection model, specifically restricting the image processing to trusted domains that host Wikimedia comments images.
- We have a big Istio (Istio is the service mesh for k8s that controls how microservices share data with each other) refactoring proposal under discussion. On Tuesday the team will have a special meeting to discuss the proposed refactoring and decide on the path forward. I'll post the slides next week if people are interested.
-
- The logo detection model is being moved to the experimental namespace. This will be a moment where we can test the model in a production setting to make sure that it has the performance that we want. This work is being coordinated really closely with the structured content team to make sure it meets their needs.
- ML and research Airflow Pipeline Sprint has started this week. This is a effort to see how we can use Airflow pipelines and GPS on the existing Hadoop Cluster to train models.
- Work continues on the Cassandra clusters that will be part of the caching solution.
- Work continues on the Hugging Face model server image. This is an effort that we're working on that will allow us to easily host many of the models that are available on Hugging Face onto Lift Wing directly. This is actually a really interesting project because it's an easy way for the community to experiment with the models that they might want to host on Lift Wing and even propose models that they might want to have on Lift Wing.
- We are working with the data center operations team on the procurement of new machines with GPUs. The current status is that we are working with the vendor to an issue around the availability of a particular server configuration and looking at some alternatives.
-
- Chris on vacation. No update this week.
-
- Big win for the week: Our HuggingFace Docker image patch has been reviewed and approved. This Docker image allows us to deploy HuggingFace models quickly onto LiftWing, in a way that will speed up all development process going forward.
- Continuing to integrate the logo-detection prototype into KServe custom model-server that will be hosted on LiftWing
- Work on revertrisk-multilingual GPU image, ensure the RRML model is compatible with torch 2.x (e.g. predictions are correct as the model was trained with 1.13)
-
- We are still working on the logo detection model for Wikimedia Commons. The current status is that we have confirmed with the product team working on the feature that the model is returning the expected outputs. The next step is to look at input validation and image size limits. The open question we are discussing with the product team is whether resizing of images should be done inside Lift Wing or prior to the image being sent to Lift Wing. Resizing is important because the logo detection model expects an image of a certain size.
- Work / banging our heads continues on the pytorch base image. For those following along, we are working with Service Ops to make a reasonably sized docker image that contains pytorch and ROCm support. If the base image is too big it becomes a problem for our Docker registry and we are trying to be good stewards of that common resource. Turns out it is harder than we thought.
- More work is happening on Lift Wing caching. We are still working out how we want Lift Wing (specifically KServe's Istio) to talk to the Cassandra servers.
- A new version of the Language Agnostic Revert Risk model has been deployed to staging and is currently doing load testing.
- More work on the HuggingFace model server integration with Lift Wing. Once we crack this we will be able to deploy most models on HuggingFace quickly.
-
- We stood up a Wikimedia community of practice for ML this week. The goal is to provide a space for all the folks around WMF that are working on the technical side of ML to share insights and learn together. Currently there are folks from a number of teams in the community of practice, including ML, Research, Content Translation, and others.
- We are still waiting for our test GPUs (one server with two MI210s) to be installed in the data center. Once we test this configuration works well in our infrastructure (a few days of testing max) we can continue with the full order.
- I am starting work on a white paper that surveys all the work Wikimedia'verse is doing around AI, this includes models WMF hosts, advocacy work done by WMF, work by volunteers, etc. If you know some people I should talk with, definitely reach out.
- We are really pushing hard on getting caching deployed. The reason is that with caching, it means we can really take full advantage of the CPUs we have now by pre-caching predictions. The end result for users is that a prediction that might take 500ms would take a fraction of that time. The exact current status of the work is that our SRE is trying to get Lift Wing to speak to the Cassandra servers.
- Our SLO dashboards need to be fixed. They are giving some wild numbers that are clearly incorrect. Our team is working with folks to figure it out.
- Work on the Logo Detection model continues. The request to host this model comes from the Structured Content team. The goal is to predict logos in Wikimedia Commons because logos account for a significant chunk of files that receive a deletion request.
- We are continuing to try to load the HuggingFace model server onto Lift Wing. When completed this offers the potential to load a model hosted on HuggingFace into Lift Wing quickly and easily, opening a huge new library of models for folks to use.
-
- We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
- We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
- We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
- Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
- General updates and bug fixes.
-
- Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
- An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
- As a partial solution to the above, we installed Dragonfly, which is a peer-2-peer layer between our Kubernetes cluster and the WMF docker registry. We will also work on some other improvements.
- We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986). This isn't done yet but it would be really nice to have.
- Fixing a bug reported about inconsistent data type for article quality scores on ptwiki. The error as because of the mixed schema of the responses returned by ORES.(T358953)
- We made our server hardware request for the next fiscal year. The short version is: GPUs.
-
- GPU order is underway. We are in the process of ordering a series of servers to use for training and inference. Each server will have two MI210 AMD GPUs. Most will be reserved for model inference (specifically, larger models like LLMs), but we will use two servers (4 GPUs) to create a model training environment. This model training environment will start very small and scrappy but will hopefully grow into a place for automated retraining of models and the standardization of model training approaches. The next steps are a single server will on its way to our data center, once this is tested we will make the full order.
- Work on caching for Lift Wing continues. We have in the process of making a large order of GPUs. However, to optimize our resource use, one of the best strategies we can do is conduct model inference using our existing CPUs. This is not always possible, for example cases when the set of possible model inputs is not finite. However, in cases where the possible inputs are finite we can cache the predictions for those inputs and then serve them to users rapidly with minimal compute used. This is a similar system to that which was originally used on ORES.
- The pentesting of Lift Wing continues. The testing is being done by a third party contractor and is examining our vulnerability to malicious code.
- Wikimedia's branding team has come out with some suggestions for the naming of machine learning tools and models. The hope is that our naming is more systematic and less ad-hoc.
- Chris helped organize and attend an event in Bellagio, Italy to craft a research agenda for researchers interested in Wikipedia. That research agenda is available here.