Difference between revisions of "Slurm"

From In The Wings
Jump to navigation Jump to search
(Created page with "SLURM is a job management package used on larger clusters. ==Handy Commands== ===Listing jobs that have run on specific nodes=== <pre> [jakers@adm2 hipergator]$ sacct --node...")
 
 
Line 13: Line 13:
 
23813188_9      helixer        gpu jeremybra+          4  NODE_FAIL      0:0
 
23813188_9      helixer        gpu jeremybra+          4  NODE_FAIL      0:0
 
23813188_9.+      batch            jeremybra+          4  CANCELLED
 
23813188_9.+      batch            jeremybra+          4  CANCELLED
 +
</pre>
 +
===Listing jobs that you have the jobid for===
 +
<pre>
 +
[jakers@adm2 comsol]$ sacct -j 23170593 --format=jobid,jobname,nodelist
 +
JobID          JobName        NodeList
 +
------------ ---------- ---------------
 +
23170593        t5_roct      c1007a-s29
 +
23170593.ba+      batch      c1007a-s29
 +
</pre>
 +
===Fields Available===
 +
<pre>
 +
                Fields available:
 +
 +
                Account            AdminComment        AllocCPUS          AllocNodes
 +
                AllocTRES          AssocID            AveCPU              AveCPUFreq
 +
                AveDiskRead        AveDiskWrite        AvePages            AveRSS
 +
                AveVMSize          BlockID            Cluster            Comment
 +
                Constraints        Container          ConsumedEnergy      ConsumedEnergyRaw
 +
                CPUTime            CPUTimeRAW          DBIndex            DerivedExitCode
 +
                Elapsed            ElapsedRaw          Eligible            End
 +
                ExitCode            Flags              GID                Group
 +
                JobID              JobIDRaw            JobName            Layout
 +
                MaxDiskRead        MaxDiskReadNode    MaxDiskReadTask    MaxDiskWrite
 +
                MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode
 +
                MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask
 +
                MaxVMSize          MaxVMSizeNode      MaxVMSizeTask      McsLabel
 +
                MinCPU              MinCPUNode          MinCPUTask          NCPUS
 +
                NNodes              NodeList            NTasks              Priority
 +
                Partition          QOS                QOSRAW              Reason
 +
                ReqCPUFreq          ReqCPUFreqMin      ReqCPUFreqMax      ReqCPUFreqGov
 +
                ReqCPUS            ReqMem              ReqNodes            ReqTRES
 +
                Reservation        ReservationId      Reserved            ResvCPU
 +
                ResvCPURAW          Start              State              Submit
 +
                SubmitLine          Suspended          SystemCPU          SystemComment
 +
                Timelimit          TimelimitRaw        TotalCPU            TRESUsageInAve
 +
                TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin
 +
                TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve
 +
                TRESUsageOutMax    TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
 +
                TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot    UID
 +
                User                UserCPU            WCKey              WCKeyID
 +
                WorkDir
 
</pre>
 
</pre>

Latest revision as of 09:06, 21 February 2024

SLURM is a job management package used on larger clusters.

Handy Commands

Listing jobs that have run on specific nodes

[jakers@adm2 hipergator]$ sacct --nodelist=c39a-s39 --starttime=2024-02-15T13:30
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
23813188_7      helixer        gpu jeremybra+          4  NODE_FAIL      0:0
23813188_7.+      batch            jeremybra+          4  CANCELLED
23813188_8      helixer        gpu jeremybra+          4  NODE_FAIL      0:0
23813188_8.+      batch            jeremybra+          4  CANCELLED
23813188_9      helixer        gpu jeremybra+          4  NODE_FAIL      0:0
23813188_9.+      batch            jeremybra+          4  CANCELLED

Listing jobs that you have the jobid for

[jakers@adm2 comsol]$ sacct -j 23170593 --format=jobid,jobname,nodelist
JobID           JobName        NodeList
------------ ---------- ---------------
23170593        t5_roct      c1007a-s29
23170593.ba+      batch      c1007a-s29

Fields Available

                 Fields available:

                 Account             AdminComment        AllocCPUS           AllocNodes
                 AllocTRES           AssocID             AveCPU              AveCPUFreq
                 AveDiskRead         AveDiskWrite        AvePages            AveRSS
                 AveVMSize           BlockID             Cluster             Comment
                 Constraints         Container           ConsumedEnergy      ConsumedEnergyRaw
                 CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode
                 Elapsed             ElapsedRaw          Eligible            End
                 ExitCode            Flags               GID                 Group
                 JobID               JobIDRaw            JobName             Layout
                 MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite
                 MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode
                 MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask
                 MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel
                 MinCPU              MinCPUNode          MinCPUTask          NCPUS
                 NNodes              NodeList            NTasks              Priority
                 Partition           QOS                 QOSRAW              Reason
                 ReqCPUFreq          ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov
                 ReqCPUS             ReqMem              ReqNodes            ReqTRES
                 Reservation         ReservationId       Reserved            ResvCPU
                 ResvCPURAW          Start               State               Submit
                 SubmitLine          Suspended           SystemCPU           SystemComment
                 Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve
                 TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin
                 TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve
                 TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
                 TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID
                 User                UserCPU             WCKey               WCKeyID
                 WorkDir