機械学習
ウィキメディア財団の機械学習チーム(Machine Learning Team)のメインページへようこそ。
Machine Learning
ウィキメディア財団と機械学習
|
私たちのチームはエンドユーザーを対象に必要なインフラを監督し、機械学習モデルの開発と管理およびその設計、トレーニング、展開を担当します。
現在のプロジェクト
- 機械学習モデル・カード
- 機械学習モデルの更新化プロジェクト
- Liftwing(リフトウィング) - Kubernetes 上インフラで KServe を採用して、スケーラブルな機械学習モデルを提供。
プロジェクトの過去ログは、こちらの一覧をご参照ください。
お問い合わせ
質問はありませんか? 機械学習の担当チームまたはボランティアのコミュニティと相談しませんか? おすすめのお問い合わせ方法です。
チームのチャットルーム
#wikimedia-ml 接続で公開 IRC チャットルームirc.libera.chat
に参加して、機械学習について相談したり、チームの作業を追いかけてみましょう。
活発な作業ボード
特定のタスクがあって協議や取り組みがしたい場合は、当チームの公開作業ボード(Work Board) Phabricator に参加してください。 作業ボードを表示
新着情報
-
- 当チームでは記事-国・地域モデルを Liftwing に統合する作業を続けています。 記事の国・地域モデルとは、長年、私たちが用いてきた記事トピックの モデルの拡張版であり、どの特定のモデルがどの国に適用するか予測します。
- 私たちは Ubuntu でさまざまなアプローチを試しており、vllm(大規模言語モデル vllm 用に高スループットでメモリ効率の高いシステムを提供)と ROCm(CPU が AMD GPU と通信するコード)を構築しようと目指しています。 これは製品版 LLM 類を Liftwing 上で稼働させる作業に含まれます。
- 現在、MLラボのサーバ設定に取り組んでいます。目的はモデルのトレーニング用です。
- rec-api 画像実装モデルを更新しました。 要出典モデルを製品版に導入。
-
- Following up on recurring issue reported by the Structured Content team: The MediaDetection API can access the logo-detection endpoint via mwdebug1001.eqiad.wmnet and mwdebug2001.codfw.wmnet, but can't access it on k8s-mwdebug
- Adding logo-detection documentation to the API portal docs.
- Investigating occasional slow queries on LiftWing when using some RevScoring models
- Continuing the remaining work on the pre-save revertrisk model. This model is designed to provide a vandalism prediction before an edit is saved to Wikipedia (and thus doesn't have a revision ID)
- Work continues on upgrading kserve to 0.13
- Initializing install config for the GPU hosts in eqiad
-
- Apologies on my end for the delay in updates, I got covid.
-
- Working continues on the Logo Detection model. We made an example logo-detection model-server that processes base64 image objects instead of image URLs and sent it to the Structured Content team for their thoughts.
- Work continues on the HuggingFace model server.
- 一般的な改善とバグ修正。
-
- Work continues on the Logo Detection model. The issues we discussed this week as a team whether or not the encoded image would be sent directly to LiftWing. Alternatively, we'd receive a URL of the image's location, which we would then allow LiftWing to access/download from. This matters because it affects the size of the REST payload, particularly with batched requests.
- We are still working on the HuggingFace model server GPU issue (i.e. it won't recognize our AMD GPU). There are a number of possibilities as to why, but we want this resolved before we finalize our order for this fiscal year.
- 多種のバグ解消と改善が複数。
-
- Our big Istio refactoring is underway (slides)! This refactoring will allow us to remove a lot of networking logic out of individual model containers. For example, currently if there was some changing to the `discovery.wmnet` endpoint (WMF's internal endpoint for APIs), we'd have to update hundreds of individual model containers and redeploy them. This refactoring removes this need entirely.
- We've been deploying AMD's open source software stack (ROCm) inside each k8s node, but we suspect this has been unnecessary (and actually causing some problems) because PyTorch already has a version of ROCm included in the library. This work is being prioritized because completing it is a requiring for making the large GPU order have have planned later in the quarter.
- We are preparing a patch that enables logo-detection model server to access external URLs using internal k8s endpoints. This is part of some of the changes we needed to make to deploy the model.
- Continuing to test the HuggingFace model server image on our Lift Wing nodes. This work was paused for the week while the engineer attended the Wikipedia Hackathon in Tallinn.
- Lift Wing caching work has been paused until the Istio refactoring is complete.
-
- Reviewing and testing the big patch for the ORES extension. The ORES extension provides a way to see the probability that a particular edit is reverted for all edits on the recent changes page of many Wikis. The new revert risk model into the extension so that volunteers can use that new model when hunting down potential vandalism.
- We're still doing some tweaks for the image processing for the logo detection model, specifically restricting the image processing to trusted domains that host Wikimedia comments images.
- We have a big Istio (Istio is the service mesh for k8s that controls how microservices share data with each other) refactoring proposal under discussion. On Tuesday the team will have a special meeting to discuss the proposed refactoring and decide on the path forward. I'll post the slides next week if people are interested.
-
- The logo detection model is being moved to the experimental namespace. This will be a moment where we can test the model in a production setting to make sure that it has the performance that we want. This work is being coordinated really closely with the structured content team to make sure it meets their needs.
- ML and research Airflow Pipeline Sprint has started this week. This is a effort to see how we can use Airflow pipelines and GPS on the existing Hadoop Cluster to train models.
- Work continues on the Cassandra clusters that will be part of the caching solution.
- Work continues on the Hugging Face model server image. This is an effort that we're working on that will allow us to easily host many of the models that are available on Hugging Face onto Lift Wing directly. This is actually a really interesting project because it's an easy way for the community to experiment with the models that they might want to host on Lift Wing and even propose models that they might want to have on Lift Wing.
- We are working with the data center operations team on the procurement of new machines with GPUs. The current status is that we are working with the vendor to an issue around the availability of a particular server configuration and looking at some alternatives.
-
- Chris on vacation. No update this week.
-
- Big win for the week: Our HuggingFace Docker image patch has been reviewed and approved. This Docker image allows us to deploy HuggingFace models quickly onto LiftWing, in a way that will speed up all development process going forward.
- Continuing to integrate the logo-detection prototype into KServe custom model-server that will be hosted on LiftWing
- Work on revertrisk-multilingual GPU image, ensure the RRML model is compatible with torch 2.x (e.g. predictions are correct as the model was trained with 1.13)
-
- We are still working on the logo detection model for Wikimedia Commons. The current status is that we have confirmed with the product team working on the feature that the model is returning the expected outputs. 次の段階ではインプット認証と画像サイズ制限を検討します。 The open question we are discussing with the product team is whether resizing of images should be done inside Lift Wing or prior to the image being sent to Lift Wing. Resizing is important because the logo detection model expects an image of a certain size.
- Work / banging our heads continues on the pytorch base image. For those following along, we are working with Service Ops to make a reasonably sized docker image that contains pytorch and ROCm support. If the base image is too big it becomes a problem for our Docker registry and we are trying to be good stewards of that common resource. Turns out it is harder than we thought.
- More work is happening on Lift Wing caching. We are still working out how we want Lift Wing (specifically KServe's Istio) to talk to the Cassandra servers.
- A new version of the Language Agnostic Revert Risk model has been deployed to staging and is currently doing load testing.
- More work on the HuggingFace model server integration with Lift Wing. Once we crack this we will be able to deploy most models on HuggingFace quickly.
-
- We stood up a Wikimedia community of practice for ML this week. The goal is to provide a space for all the folks around WMF that are working on the technical side of ML to share insights and learn together. Currently there are folks from a number of teams in the community of practice, including ML, Research, Content Translation, and others.
- テスト用 GPU 一式の手配とデータセンターのインストールをまだ待機中(サーバ1件に MI210 を2基搭載)。この設定の初期テストが終わり私たちのインフラで良好に作動すると確認でき次第(試験期間は最長数日)、受注案件をフルに処理できる見込みです。
- I am starting work on a white paper that surveys all the work Wikimedia'verse is doing around AI, this includes models WMF hosts, advocacy work done by WMF, work by volunteers, etc. If you know some people I should talk with, definitely reach out.
- We are really pushing hard on getting caching deployed. The reason is that with caching, it means we can really take full advantage of the CPUs we have now by pre-caching predictions. The end result for users is that a prediction that might take 500ms would take a fraction of that time. The exact current status of the work is that our SRE is trying to get Lift Wing to speak to the Cassandra servers.
- SLO ダッシュボードは修正が必要です。途方もない数字を返しており、明らかに間違っています。当チームでは助っ人の手を借りて原因を究明中です。
- Work on the Logo Detection model continues. The request to host this model comes from the Structured Content team. 目標はウィキメディア コモンズでロゴを予測することであり、ロゴは頻繁に削除申請を招くファイル件数がかなり大きくなりがちだからです。
- We are continuing to try to load the HuggingFace model server onto Lift Wing. When completed this offers the potential to load a model hosted on HuggingFace into Lift Wing quickly and easily, opening a huge new library of models for folks to use.
-
- We are working on deploying a model for the Structured Content team that detects potentially copyrighted image uploads on Commons, specifically images with logos. (T358676)
- We are continuing to work on hosting HuggingFace model server on Lift Wing. This would make deploying HuggingFace models super simple.
- We have deployed Dragonfly cache on Lift Wing to help with Docker image sizes.
- Our Cassandra databases for an eventual caching system is in production. Still more work to do but its a good start.
- 一般的なバグ修正と更新。
-
- Sorry for the update being one day late, Chris (I) attended the Strategy meeting in NYC and is writing this update from the plane back.
- An issue we are facing is that WMF's docker registry is set up for smaller docker images (~2GBs). However, the docker images of the team can get pretty big because of ROCm/Pytorch (~6-8GB). We are working out how to resolve that. There a number of strategies can do, from optimizing the image layers better to requesting the max docker image size limit to be increased.
- 部分的な解決策として、Kubernetes クラスターと WMF の Docker レジストリとの間にピア・ツー・ピア層を提供する Dragonfly を導入しました。今後、他の改善にも取り組む予定です。
- We are continue working on including HuggingFace's prebuilt model server into Lift Wing. This would mean we could quickly deploy any model on HuggingFace with all the optimizations HuggingFace provides. (T357986) まだ未処理ですが、実現したらすごく助かるはずです。
- 記事の品質スコアに対するデータタイプの不整合バグが ptwiki に上がったら、バグを修正すること。エラーの原因はORES から返された応答スキーマの混乱。 (T358953)
- 次の会計年度用に、サーバのハードウェアを申請しました。これを短縮した呼び名は:GPU 類。
-
- GPU 発注は進行中。 トレーニングと推論に使うため、一連のサーバを発注しているところです。 サーバ単位で MI210 AMD GPU 2基を配分の予定です。 大部分はモデル推論用に予約しておき(具体的には LLM のような大規模モデル用)、モデルのトレーニング環境を作成するにはサーバ2台(GPU4基)を使う予定です。 このモデルトレーニング環境は最初はごく小さく限定的に始めて、しばらくしたらモデルの自動再トレーニングならびにモデルトレーニングの標準アプローチの場として成長するよう期待しています。 次のステップにはデータセンター内に単一のサーバを確保することで、テストをした段階で完全発注に移ります。
- LiftWing 対応のキャッシュ作業は進行中。 現在、GPU の大量発注をしている途中です。 しかしながら、リソースの活用を最大化するため最善の戦略を採用するなら、その一例は既存の CPU 類を使ってモデル推論をすることです。 これはどんな場合でも可能ではなく、一例として可能な入力モデルのセットが有限ではない場合が相当します。 ただし可能な入力が有限な場合は、それら入力の予測をキャッシュしておくと、最小限の計算で利用者に迅速な提供ができます。 これは元の ORES 時代に採用したのと類似のシステムです。
- LiftWing(リフトウィング)の侵入テスト(pentesting)を継続中。サードパーティの請負業者にテストを外注して、悪意のあるコードに対する脆弱性を調査中です。
- 機械学習ツールとそのモデルの命名(naming)に関して、ウィキメディアのブランディング・チームから提案がありました。思いつきではなく全体のシステムに馴染む命名が期待されます。
- クリスさん(Chris)はベラージオ(イタリア)のイベントで参加と進行を補佐し、ウィキペディアに関心を寄せる研究者を対象とする研究の議事録作りに取り組みました。その議事録(アジェンダ)はこちらにまとめがあります。